The SARS-CoV-2 virus, also known as, Coronavirus 2019 and popularly referred to as Covid-19 was first reported in 2019 and declared a pandemic in March, 2020. For two years, the world has battled the disease through traditional public health measures such as quarantine and social distancing as well as innovative technologies such as rapid testing and breakthrough vaccines. In areas with overwhelmed healthcare systems, it is impractical to test every patient that presents with flu-like symptoms. It is therefore important to set up a targeted testing criteria that will be effective in identifying positive Covid-19 cases.
To predict chances of a positive or negative Covid-19 test result and identify the factors that influence these results by using a collection of laboratory tests from suspected cases.
In areas with overwhelmed healthcare systems, it is impractical to test every patient that presents with flu-like symptoms. It is therefore important to set up a targeted testing criteria that will be effective in identifying positive Covid-19 cases.
a) This will enable fair allocation of resources in the management of Covid-19 cases.
b) Developing an effective criteria will enable rapid detection of cases and reduce the disease burden by quickly initiating management of positive cases.
c) This algorithm could reduce hospital wait times and shorten queues in the waiting rooms and testing centers.
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To perform statistical analysis
import scipy.stats as stats
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,precision_recall_curve,
roc_curve,
roc_auc_score
)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To undersample and oversample the data
!pip install imblearn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
Requirement already satisfied: imblearn in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (0.0) Requirement already satisfied: imbalanced-learn in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imblearn) (0.10.1) Requirement already satisfied: numpy>=1.17.3 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.21.5) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (2.2.0) Requirement already satisfied: joblib>=1.1.1 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.2.0) Requirement already satisfied: scikit-learn>=1.0.2 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.0.2) Requirement already satisfied: scipy>=1.3.2 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.9.1)
data=pd.read_excel('covid19_dataset.xlsx')
df=data.copy()
df.head()
| Patient ID | Patient age quantile | SARS-Cov-2 exam result | Patient addmited to regular ward (1=yes, 0=no) | Patient addmited to semi-intensive unit (1=yes, 0=no) | Patient addmited to intensive care unit (1=yes, 0=no) | Hematocrit | Hemoglobin | Platelets | Mean platelet volume | Red blood Cells | Lymphocytes | Mean corpuscular hemoglobin concentration (MCHC) | Leukocytes | Basophils | Mean corpuscular hemoglobin (MCH) | Eosinophils | Mean corpuscular volume (MCV) | Monocytes | Red blood cell distribution width (RDW) | Serum Glucose | Respiratory Syncytial Virus | Influenza A | Influenza B | Parainfluenza 1 | CoronavirusNL63 | Rhinovirus/Enterovirus | Mycoplasma pneumoniae | Coronavirus HKU1 | Parainfluenza 3 | Chlamydophila pneumoniae | Adenovirus | Parainfluenza 4 | Coronavirus229E | CoronavirusOC43 | Inf A H1N1 2009 | Bordetella pertussis | Metapneumovirus | Parainfluenza 2 | Neutrophils | Urea | Proteina C reativa mg/dL | Creatinine | Potassium | Sodium | Influenza B, rapid test | Influenza A, rapid test | Alanine transaminase | Aspartate transaminase | Gamma-glutamyltransferase | Total Bilirubin | Direct Bilirubin | Indirect Bilirubin | Alkaline phosphatase | Ionized calcium | Strepto A | Magnesium | pCO2 (venous blood gas analysis) | Hb saturation (venous blood gas analysis) | Base excess (venous blood gas analysis) | pO2 (venous blood gas analysis) | Fio2 (venous blood gas analysis) | Total CO2 (venous blood gas analysis) | pH (venous blood gas analysis) | HCO3 (venous blood gas analysis) | Rods # | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine - Esterase | Urine - Aspect | Urine - pH | Urine - Hemoglobin | Urine - Bile pigments | Urine - Ketone Bodies | Urine - Nitrite | Urine - Density | Urine - Urobilinogen | Urine - Protein | Urine - Sugar | Urine - Leukocytes | Urine - Crystals | Urine - Red blood cells | Urine - Hyaline cylinders | Urine - Granular cylinders | Urine - Yeasts | Urine - Color | Partial thromboplastin time (PTT) | Relationship (Patient/Normal) | International normalized ratio (INR) | Lactic Dehydrogenase | Prothrombin time (PT), Activity | Vitamin B12 | Creatine phosphokinase (CPK) | Ferritin | Arterial Lactic Acid | Lipase dosage | D-Dimer | Albumin | Hb saturation (arterial blood gases) | pCO2 (arterial blood gas analysis) | Base excess (arterial blood gas analysis) | pH (arterial blood gas analysis) | Total CO2 (arterial blood gas analysis) | HCO3 (arterial blood gas analysis) | pO2 (arterial blood gas analysis) | Arteiral Fio2 | Phosphor | ctO2 (arterial blood gas analysis) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44477f75e8169d2 | 13 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 126e9dd13932f68 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | -0.141 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | -0.619 | 1.198 | -0.148 | 2.090 | -0.306 | 0.863 | negative | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | a46b4402a0e5696 | 8 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | f7d619a94f97c45 | 5 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | d9e41465789c2b5 | 15 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# viewing a random sample of the dataset
df.sample(n=10, random_state=1)
| Patient ID | Patient age quantile | SARS-Cov-2 exam result | Patient addmited to regular ward (1=yes, 0=no) | Patient addmited to semi-intensive unit (1=yes, 0=no) | Patient addmited to intensive care unit (1=yes, 0=no) | Hematocrit | Hemoglobin | Platelets | Mean platelet volume | Red blood Cells | Lymphocytes | Mean corpuscular hemoglobin concentration (MCHC) | Leukocytes | Basophils | Mean corpuscular hemoglobin (MCH) | Eosinophils | Mean corpuscular volume (MCV) | Monocytes | Red blood cell distribution width (RDW) | Serum Glucose | Respiratory Syncytial Virus | Influenza A | Influenza B | Parainfluenza 1 | CoronavirusNL63 | Rhinovirus/Enterovirus | Mycoplasma pneumoniae | Coronavirus HKU1 | Parainfluenza 3 | Chlamydophila pneumoniae | Adenovirus | Parainfluenza 4 | Coronavirus229E | CoronavirusOC43 | Inf A H1N1 2009 | Bordetella pertussis | Metapneumovirus | Parainfluenza 2 | Neutrophils | Urea | Proteina C reativa mg/dL | Creatinine | Potassium | Sodium | Influenza B, rapid test | Influenza A, rapid test | Alanine transaminase | Aspartate transaminase | Gamma-glutamyltransferase | Total Bilirubin | Direct Bilirubin | Indirect Bilirubin | Alkaline phosphatase | Ionized calcium | Strepto A | Magnesium | pCO2 (venous blood gas analysis) | Hb saturation (venous blood gas analysis) | Base excess (venous blood gas analysis) | pO2 (venous blood gas analysis) | Fio2 (venous blood gas analysis) | Total CO2 (venous blood gas analysis) | pH (venous blood gas analysis) | HCO3 (venous blood gas analysis) | Rods # | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine - Esterase | Urine - Aspect | Urine - pH | Urine - Hemoglobin | Urine - Bile pigments | Urine - Ketone Bodies | Urine - Nitrite | Urine - Density | Urine - Urobilinogen | Urine - Protein | Urine - Sugar | Urine - Leukocytes | Urine - Crystals | Urine - Red blood cells | Urine - Hyaline cylinders | Urine - Granular cylinders | Urine - Yeasts | Urine - Color | Partial thromboplastin time (PTT) | Relationship (Patient/Normal) | International normalized ratio (INR) | Lactic Dehydrogenase | Prothrombin time (PT), Activity | Vitamin B12 | Creatine phosphokinase (CPK) | Ferritin | Arterial Lactic Acid | Lipase dosage | D-Dimer | Albumin | Hb saturation (arterial blood gases) | pCO2 (arterial blood gas analysis) | Base excess (arterial blood gas analysis) | pH (arterial blood gas analysis) | Total CO2 (arterial blood gas analysis) | HCO3 (arterial blood gas analysis) | pO2 (arterial blood gas analysis) | Arteiral Fio2 | Phosphor | ctO2 (arterial blood gas analysis) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4441 | b7c8bff333721c1 | 12 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1603 | 484d8a9c71f01d2 | 1 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1206 | 1f3c363371d0462 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1586 | 938004044cac19f | 6 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2730 | 2e4ddd5e391680f | 16 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3205 | e030e21895c4929 | 9 | negative | 0 | 0 | 0 | 0.191 | 0.228 | 0.965 | -0.438 | 0.031 | 1.461 | 0.244 | 0.573 | -1.140 | 0.283 | 0.007 | 0.226 | -0.746 | -1.244 | -0.413 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.454 | -0.533 | 0.091 | -1.789 | 0.863 | negative | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -0.307 | -0.832 | -0.102 | -0.316 | -0.233 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5321 | cf53da64a0eb988 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 943 | 65f8331bccab88d | 17 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5029 | 968bd25963663dc | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1998 | 15e96beeae631ab | 1 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The data from this dataset seems to have been collected from patient medical records.
It is likely a secondary source of data.
#Code to asscertain number of rows and columns
df.shape
(5644, 111)
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5644 entries, 0 to 5643 Columns: 111 entries, Patient ID to ctO2 (arterial blood gas analysis) dtypes: float64(70), int64(4), object(37) memory usage: 4.8+ MB
df.isna().sum()
Patient ID 0 Patient age quantile 0 SARS-Cov-2 exam result 0 Patient addmited to regular ward (1=yes, 0=no) 0 Patient addmited to semi-intensive unit (1=yes, 0=no) 0 Patient addmited to intensive care unit (1=yes, 0=no) 0 Hematocrit 5041 Hemoglobin 5041 Platelets 5042 Mean platelet volume 5045 Red blood Cells 5042 Lymphocytes 5042 Mean corpuscular hemoglobin concentration (MCHC) 5042 Leukocytes 5042 Basophils 5042 Mean corpuscular hemoglobin (MCH) 5042 Eosinophils 5042 Mean corpuscular volume (MCV) 5042 Monocytes 5043 Red blood cell distribution width (RDW) 5042 Serum Glucose 5436 Respiratory Syncytial Virus 4290 Influenza A 4290 Influenza B 4290 Parainfluenza 1 4292 CoronavirusNL63 4292 Rhinovirus/Enterovirus 4292 Mycoplasma pneumoniae 5644 Coronavirus HKU1 4292 Parainfluenza 3 4292 Chlamydophila pneumoniae 4292 Adenovirus 4292 Parainfluenza 4 4292 Coronavirus229E 4292 CoronavirusOC43 4292 Inf A H1N1 2009 4292 Bordetella pertussis 4292 Metapneumovirus 4292 Parainfluenza 2 4292 Neutrophils 5131 Urea 5247 Proteina C reativa mg/dL 5138 Creatinine 5220 Potassium 5273 Sodium 5274 Influenza B, rapid test 4824 Influenza A, rapid test 4824 Alanine transaminase 5419 Aspartate transaminase 5418 Gamma-glutamyltransferase 5491 Total Bilirubin 5462 Direct Bilirubin 5462 Indirect Bilirubin 5462 Alkaline phosphatase 5500 Ionized calcium 5594 Strepto A 5312 Magnesium 5604 pCO2 (venous blood gas analysis) 5508 Hb saturation (venous blood gas analysis) 5508 Base excess (venous blood gas analysis) 5508 pO2 (venous blood gas analysis) 5508 Fio2 (venous blood gas analysis) 5643 Total CO2 (venous blood gas analysis) 5508 pH (venous blood gas analysis) 5508 HCO3 (venous blood gas analysis) 5508 Rods # 5547 Segmented 5547 Promyelocytes 5547 Metamyelocytes 5547 Myelocytes 5547 Myeloblasts 5547 Urine - Esterase 5584 Urine - Aspect 5574 Urine - pH 5574 Urine - Hemoglobin 5574 Urine - Bile pigments 5574 Urine - Ketone Bodies 5587 Urine - Nitrite 5643 Urine - Density 5574 Urine - Urobilinogen 5575 Urine - Protein 5584 Urine - Sugar 5644 Urine - Leukocytes 5574 Urine - Crystals 5574 Urine - Red blood cells 5574 Urine - Hyaline cylinders 5577 Urine - Granular cylinders 5575 Urine - Yeasts 5574 Urine - Color 5574 Partial thromboplastin time (PTT) 5644 Relationship (Patient/Normal) 5553 International normalized ratio (INR) 5511 Lactic Dehydrogenase 5543 Prothrombin time (PT), Activity 5644 Vitamin B12 5641 Creatine phosphokinase (CPK) 5540 Ferritin 5621 Arterial Lactic Acid 5617 Lipase dosage 5636 D-Dimer 5644 Albumin 5631 Hb saturation (arterial blood gases) 5617 pCO2 (arterial blood gas analysis) 5617 Base excess (arterial blood gas analysis) 5617 pH (arterial blood gas analysis) 5617 Total CO2 (arterial blood gas analysis) 5617 HCO3 (arterial blood gas analysis) 5617 pO2 (arterial blood gas analysis) 5617 Arteiral Fio2 5624 Phosphor 5624 ctO2 (arterial blood gas analysis) 5617 dtype: int64
The dataset contains missing data.
# checking for duplicate values
df.duplicated().sum()
0
There are no duplicated values in this dataset
#Code to check for number of unique values in each data set
df.nunique()
Patient ID 5644 Patient age quantile 20 SARS-Cov-2 exam result 2 Patient addmited to regular ward (1=yes, 0=no) 2 Patient addmited to semi-intensive unit (1=yes, 0=no) 2 Patient addmited to intensive care unit (1=yes, 0=no) 2 Hematocrit 176 Hemoglobin 84 Platelets 249 Mean platelet volume 48 Red blood Cells 211 Lymphocytes 318 Mean corpuscular hemoglobin concentration (MCHC) 57 Leukocytes 475 Basophils 17 Mean corpuscular hemoglobin (MCH) 91 Eosinophils 86 Mean corpuscular volume (MCV) 190 Monocytes 146 Red blood cell distribution width (RDW) 61 Serum Glucose 71 Respiratory Syncytial Virus 2 Influenza A 2 Influenza B 2 Parainfluenza 1 2 CoronavirusNL63 2 Rhinovirus/Enterovirus 2 Mycoplasma pneumoniae 0 Coronavirus HKU1 2 Parainfluenza 3 2 Chlamydophila pneumoniae 2 Adenovirus 2 Parainfluenza 4 2 Coronavirus229E 2 CoronavirusOC43 2 Inf A H1N1 2009 2 Bordetella pertussis 2 Metapneumovirus 2 Parainfluenza 2 1 Neutrophils 308 Urea 54 Proteina C reativa mg/dL 265 Creatinine 119 Potassium 22 Sodium 19 Influenza B, rapid test 2 Influenza A, rapid test 2 Alanine transaminase 62 Aspartate transaminase 51 Gamma-glutamyltransferase 70 Total Bilirubin 19 Direct Bilirubin 10 Indirect Bilirubin 10 Alkaline phosphatase 82 Ionized calcium 20 Strepto A 3 Magnesium 9 pCO2 (venous blood gas analysis) 97 Hb saturation (venous blood gas analysis) 120 Base excess (venous blood gas analysis) 72 pO2 (venous blood gas analysis) 121 Fio2 (venous blood gas analysis) 1 Total CO2 (venous blood gas analysis) 78 pH (venous blood gas analysis) 89 HCO3 (venous blood gas analysis) 78 Rods # 15 Segmented 55 Promyelocytes 2 Metamyelocytes 4 Myelocytes 4 Myeloblasts 1 Urine - Esterase 2 Urine - Aspect 4 Urine - pH 15 Urine - Hemoglobin 3 Urine - Bile pigments 2 Urine - Ketone Bodies 2 Urine - Nitrite 1 Urine - Density 24 Urine - Urobilinogen 2 Urine - Protein 2 Urine - Sugar 0 Urine - Leukocytes 31 Urine - Crystals 5 Urine - Red blood cells 32 Urine - Hyaline cylinders 1 Urine - Granular cylinders 1 Urine - Yeasts 1 Urine - Color 4 Partial thromboplastin time (PTT) 0 Relationship (Patient/Normal) 35 International normalized ratio (INR) 42 Lactic Dehydrogenase 79 Prothrombin time (PT), Activity 0 Vitamin B12 3 Creatine phosphokinase (CPK) 77 Ferritin 23 Arterial Lactic Acid 13 Lipase dosage 7 D-Dimer 0 Albumin 10 Hb saturation (arterial blood gases) 23 pCO2 (arterial blood gas analysis) 25 Base excess (arterial blood gas analysis) 20 pH (arterial blood gas analysis) 24 Total CO2 (arterial blood gas analysis) 24 HCO3 (arterial blood gas analysis) 23 pO2 (arterial blood gas analysis) 27 Arteiral Fio2 9 Phosphor 16 ctO2 (arterial blood gas analysis) 19 dtype: int64
There are 5644 unique patients.
# statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient ID | 5644 | 5644 | 44477f75e8169d2 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Patient age quantile | 5644.000 | NaN | NaN | NaN | 9.318 | 5.778 | 0.000 | 4.000 | 9.000 | 14.000 | 19.000 |
| SARS-Cov-2 exam result | 5644 | 2 | negative | 5086 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Patient addmited to regular ward (1=yes, 0=no) | 5644.000 | NaN | NaN | NaN | 0.014 | 0.117 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient addmited to semi-intensive unit (1=yes, 0=no) | 5644.000 | NaN | NaN | NaN | 0.009 | 0.094 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient addmited to intensive care unit (1=yes, 0=no) | 5644.000 | NaN | NaN | NaN | 0.007 | 0.085 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Hematocrit | 603.000 | NaN | NaN | NaN | -0.000 | 1.001 | -4.501 | -0.519 | 0.053 | 0.717 | 2.663 |
| Hemoglobin | 603.000 | NaN | NaN | NaN | -0.000 | 1.001 | -4.346 | -0.586 | 0.040 | 0.730 | 2.672 |
| Platelets | 602.000 | NaN | NaN | NaN | -0.000 | 1.001 | -2.552 | -0.605 | -0.122 | 0.531 | 9.532 |
| Mean platelet volume | 599.000 | NaN | NaN | NaN | 0.000 | 1.001 | -2.458 | -0.662 | -0.102 | 0.684 | 3.713 |
| Red blood Cells | 602.000 | NaN | NaN | NaN | 0.000 | 1.001 | -3.971 | -0.568 | 0.014 | 0.666 | 3.646 |
| Lymphocytes | 602.000 | NaN | NaN | NaN | -0.000 | 1.001 | -1.865 | -0.731 | -0.014 | 0.598 | 3.764 |
| Mean corpuscular hemoglobin concentration (MCHC) | 602.000 | NaN | NaN | NaN | 0.000 | 1.001 | -5.432 | -0.552 | -0.055 | 0.642 | 3.331 |
| Leukocytes | 602.000 | NaN | NaN | NaN | 0.000 | 1.001 | -2.020 | -0.637 | -0.213 | 0.454 | 4.522 |
| Basophils | 602.000 | NaN | NaN | NaN | -0.000 | 1.001 | -1.140 | -0.529 | -0.224 | 0.387 | 11.078 |
| Mean corpuscular hemoglobin (MCH) | 602.000 | NaN | NaN | NaN | -0.000 | 1.001 | -5.938 | -0.501 | 0.126 | 0.596 | 4.099 |
| Eosinophils | 602.000 | NaN | NaN | NaN | 0.000 | 1.001 | -0.836 | -0.667 | -0.330 | 0.344 | 8.351 |
| Mean corpuscular volume (MCV) | 602.000 | NaN | NaN | NaN | -0.000 | 1.001 | -5.102 | -0.515 | 0.066 | 0.627 | 3.411 |
| Monocytes | 601.000 | NaN | NaN | NaN | -0.000 | 1.001 | -2.164 | -0.614 | -0.115 | 0.489 | 4.533 |
| Red blood cell distribution width (RDW) | 602.000 | NaN | NaN | NaN | 0.000 | 1.001 | -1.598 | -0.625 | -0.183 | 0.348 | 6.982 |
| Serum Glucose | 208.000 | NaN | NaN | NaN | 0.000 | 1.002 | -1.110 | -0.504 | -0.292 | 0.139 | 7.006 |
| Respiratory Syncytial Virus | 1354 | 2 | not_detected | 1302 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Influenza A | 1354 | 2 | not_detected | 1336 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Influenza B | 1354 | 2 | not_detected | 1277 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Parainfluenza 1 | 1352 | 2 | not_detected | 1349 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CoronavirusNL63 | 1352 | 2 | not_detected | 1307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Rhinovirus/Enterovirus | 1352 | 2 | not_detected | 973 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Mycoplasma pneumoniae | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Coronavirus HKU1 | 1352 | 2 | not_detected | 1332 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Parainfluenza 3 | 1352 | 2 | not_detected | 1342 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Chlamydophila pneumoniae | 1352 | 2 | not_detected | 1343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Adenovirus | 1352 | 2 | not_detected | 1339 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Parainfluenza 4 | 1352 | 2 | not_detected | 1333 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Coronavirus229E | 1352 | 2 | not_detected | 1343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CoronavirusOC43 | 1352 | 2 | not_detected | 1344 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Inf A H1N1 2009 | 1352 | 2 | not_detected | 1254 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Bordetella pertussis | 1352 | 2 | not_detected | 1350 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Metapneumovirus | 1352 | 2 | not_detected | 1338 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Parainfluenza 2 | 1352 | 1 | not_detected | 1352 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Neutrophils | 513.000 | NaN | NaN | NaN | 0.000 | 1.001 | -3.340 | -0.652 | -0.054 | 0.684 | 2.536 |
| Urea | 397.000 | NaN | NaN | NaN | -0.000 | 1.001 | -1.630 | -0.588 | -0.142 | 0.454 | 11.247 |
| Proteina C reativa mg/dL | 506.000 | NaN | NaN | NaN | 0.000 | 1.001 | -0.535 | -0.514 | -0.394 | 0.032 | 8.027 |
| Creatinine | 424.000 | NaN | NaN | NaN | -0.000 | 1.001 | -2.390 | -0.632 | -0.081 | 0.513 | 5.054 |
| Potassium | 371.000 | NaN | NaN | NaN | 0.000 | 1.001 | -2.283 | -0.800 | -0.059 | 0.683 | 3.402 |
| Sodium | 370.000 | NaN | NaN | NaN | 0.000 | 1.001 | -5.247 | -0.575 | 0.144 | 0.503 | 4.097 |
| Influenza B, rapid test | 820 | 2 | negative | 771 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Influenza A, rapid test | 820 | 2 | negative | 768 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Alanine transaminase | 225.000 | NaN | NaN | NaN | 0.000 | 1.002 | -0.642 | -0.449 | -0.284 | 0.102 | 7.931 |
| Aspartate transaminase | 226.000 | NaN | NaN | NaN | -0.000 | 1.002 | -0.704 | -0.433 | -0.278 | 0.031 | 7.231 |
| Gamma-glutamyltransferase | 153.000 | NaN | NaN | NaN | -0.000 | 1.003 | -0.477 | -0.376 | -0.286 | -0.061 | 8.508 |
| Total Bilirubin | 182.000 | NaN | NaN | NaN | -0.000 | 1.003 | -1.093 | -0.787 | -0.175 | 0.131 | 5.029 |
| Direct Bilirubin | 182.000 | NaN | NaN | NaN | 0.000 | 1.003 | -1.170 | -0.586 | -0.003 | -0.003 | 6.996 |
| Indirect Bilirubin | 182.000 | NaN | NaN | NaN | 0.000 | 1.003 | -0.771 | -0.771 | -0.279 | 0.214 | 6.615 |
| Alkaline phosphatase | 144.000 | NaN | NaN | NaN | -0.000 | 1.003 | -0.959 | -0.609 | -0.358 | 0.054 | 3.883 |
| Ionized calcium | 50.000 | NaN | NaN | NaN | 0.000 | 1.010 | -2.100 | -0.729 | 0.060 | 0.558 | 3.549 |
| Strepto A | 332 | 3 | negative | 297 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Magnesium | 40.000 | NaN | NaN | NaN | -0.000 | 1.013 | -2.191 | -0.558 | -0.014 | 0.531 | 2.164 |
| pCO2 (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | -0.000 | 1.004 | -2.705 | -0.547 | 0.014 | 0.619 | 5.680 |
| Hb saturation (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | 0.000 | 1.004 | -2.296 | -0.803 | 0.090 | 0.817 | 1.708 |
| Base excess (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | -0.000 | 1.004 | -3.669 | -0.402 | 0.080 | 0.554 | 3.357 |
| pO2 (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | -0.000 | 1.004 | -1.634 | -0.694 | -0.213 | 0.483 | 3.775 |
| Fio2 (venous blood gas analysis) | 1.000 | NaN | NaN | NaN | 0.000 | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Total CO2 (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | -0.000 | 1.004 | -2.598 | -0.495 | 0.104 | 0.542 | 3.021 |
| pH (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | 0.000 | 1.004 | -4.773 | -0.526 | -0.091 | 0.490 | 2.790 |
| HCO3 (venous blood gas analysis) | 136.000 | NaN | NaN | NaN | -0.000 | 1.004 | -2.645 | -0.529 | 0.101 | 0.529 | 2.782 |
| Rods # | 97.000 | NaN | NaN | NaN | 0.000 | 1.005 | -0.624 | -0.624 | -0.624 | 0.326 | 3.496 |
| Segmented | 97.000 | NaN | NaN | NaN | -0.000 | 1.005 | -2.264 | -0.673 | 0.176 | 0.919 | 1.502 |
| Promyelocytes | 97.000 | NaN | NaN | NaN | 0.000 | 1.005 | -0.102 | -0.102 | -0.102 | -0.102 | 9.798 |
| Metamyelocytes | 97.000 | NaN | NaN | NaN | 0.000 | 1.005 | -0.316 | -0.316 | -0.316 | -0.316 | 6.136 |
| Myelocytes | 97.000 | NaN | NaN | NaN | 0.000 | 1.005 | -0.233 | -0.233 | -0.233 | -0.233 | 6.551 |
| Myeloblasts | 97.000 | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Urine - Esterase | 60 | 2 | absent | 59 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Aspect | 70 | 4 | clear | 61 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - pH | 70 | 15 | 5.0 | 14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Hemoglobin | 70 | 3 | absent | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Bile pigments | 70 | 2 | absent | 69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Ketone Bodies | 57 | 2 | absent | 56 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Nitrite | 1 | 1 | not_done | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Density | 70.000 | NaN | NaN | NaN | -0.000 | 1.007 | -1.757 | -0.764 | -0.055 | 0.655 | 2.499 |
| Urine - Urobilinogen | 69 | 2 | normal | 68 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Protein | 60 | 2 | absent | 59 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Sugar | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Leukocytes | 70 | 31 | <1000 | 9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Crystals | 70 | 5 | Ausentes | 65 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Red blood cells | 70.000 | NaN | NaN | NaN | 0.000 | 1.007 | -0.202 | -0.202 | -0.194 | -0.166 | 7.822 |
| Urine - Hyaline cylinders | 67 | 1 | absent | 67 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Granular cylinders | 69 | 1 | absent | 69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Yeasts | 70 | 1 | absent | 70 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Color | 70 | 4 | yellow | 55 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Partial thromboplastin time (PTT) | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Relationship (Patient/Normal) | 91.000 | NaN | NaN | NaN | -0.000 | 1.006 | -2.351 | -0.497 | -0.089 | 0.453 | 4.706 |
| International normalized ratio (INR) | 133.000 | NaN | NaN | NaN | -0.000 | 1.004 | -1.797 | -0.665 | -0.156 | 0.297 | 7.370 |
| Lactic Dehydrogenase | 101.000 | NaN | NaN | NaN | 0.000 | 1.005 | -1.359 | -0.700 | -0.331 | 0.473 | 2.950 |
| Prothrombin time (PT), Activity | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Vitamin B12 | 3.000 | NaN | NaN | NaN | -0.000 | 1.225 | -1.401 | -0.435 | 0.531 | 0.700 | 0.870 |
| Creatine phosphokinase (CPK) | 104.000 | NaN | NaN | NaN | -0.000 | 1.005 | -0.516 | -0.377 | -0.225 | 0.035 | 7.216 |
| Ferritin | 23.000 | NaN | NaN | NaN | 0.000 | 1.022 | -0.628 | -0.560 | -0.358 | 0.120 | 3.846 |
| Arterial Lactic Acid | 27.000 | NaN | NaN | NaN | -0.000 | 1.019 | -1.091 | -0.695 | -0.298 | 0.230 | 3.004 |
| Lipase dosage | 8.000 | NaN | NaN | NaN | -0.000 | 1.069 | -1.192 | -0.547 | -0.351 | 0.182 | 1.725 |
| D-Dimer | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Albumin | 13.000 | NaN | NaN | NaN | -0.000 | 1.041 | -2.290 | -0.539 | -0.038 | 0.462 | 1.963 |
| Hb saturation (arterial blood gases) | 27.000 | NaN | NaN | NaN | -0.000 | 1.019 | -2.000 | -1.123 | 0.268 | 0.738 | 1.337 |
| pCO2 (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | 0.000 | 1.019 | -1.245 | -0.535 | -0.212 | 0.023 | 3.237 |
| Base excess (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | -0.000 | 1.019 | -3.083 | -0.331 | -0.012 | 0.666 | 1.703 |
| pH (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | 0.000 | 1.019 | -3.569 | -0.092 | 0.294 | 0.512 | 1.043 |
| Total CO2 (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | -0.000 | 1.019 | -2.926 | -0.512 | 0.077 | 0.439 | 1.940 |
| HCO3 (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | 0.000 | 1.019 | -2.986 | -0.540 | 0.056 | 0.509 | 2.029 |
| pO2 (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | -0.000 | 1.019 | -1.176 | -0.817 | -0.160 | 0.450 | 2.205 |
| Arteiral Fio2 | 20.000 | NaN | NaN | NaN | 0.000 | 1.026 | -1.533 | -0.121 | -0.012 | -0.012 | 2.842 |
| Phosphor | 20.000 | NaN | NaN | NaN | 0.000 | 1.026 | -1.481 | -0.553 | -0.138 | 0.276 | 2.862 |
| ctO2 (arterial blood gas analysis) | 27.000 | NaN | NaN | NaN | 0.000 | 1.019 | -2.900 | -0.485 | 0.183 | 0.594 | 1.827 |
# Columns to be analyzed
df.columns.tolist()
['Patient ID', 'Patient age quantile', 'SARS-Cov-2 exam result', 'Patient addmited to regular ward (1=yes, 0=no)', 'Patient addmited to semi-intensive unit (1=yes, 0=no)', 'Patient addmited to intensive care unit (1=yes, 0=no)', 'Hematocrit', 'Hemoglobin', 'Platelets', 'Mean platelet volume ', 'Red blood Cells', 'Lymphocytes', 'Mean corpuscular hemoglobin concentration\xa0(MCHC)', 'Leukocytes', 'Basophils', 'Mean corpuscular hemoglobin (MCH)', 'Eosinophils', 'Mean corpuscular volume (MCV)', 'Monocytes', 'Red blood cell distribution width (RDW)', 'Serum Glucose', 'Respiratory Syncytial Virus', 'Influenza A', 'Influenza B', 'Parainfluenza 1', 'CoronavirusNL63', 'Rhinovirus/Enterovirus', 'Mycoplasma pneumoniae', 'Coronavirus HKU1', 'Parainfluenza 3', 'Chlamydophila pneumoniae', 'Adenovirus', 'Parainfluenza 4', 'Coronavirus229E', 'CoronavirusOC43', 'Inf A H1N1 2009', 'Bordetella pertussis', 'Metapneumovirus', 'Parainfluenza 2', 'Neutrophils', 'Urea', 'Proteina C reativa mg/dL', 'Creatinine', 'Potassium', 'Sodium', 'Influenza B, rapid test', 'Influenza A, rapid test', 'Alanine transaminase', 'Aspartate transaminase', 'Gamma-glutamyltransferase\xa0', 'Total Bilirubin', 'Direct Bilirubin', 'Indirect Bilirubin', 'Alkaline phosphatase', 'Ionized calcium\xa0', 'Strepto A', 'Magnesium', 'pCO2 (venous blood gas analysis)', 'Hb saturation (venous blood gas analysis)', 'Base excess (venous blood gas analysis)', 'pO2 (venous blood gas analysis)', 'Fio2 (venous blood gas analysis)', 'Total CO2 (venous blood gas analysis)', 'pH (venous blood gas analysis)', 'HCO3 (venous blood gas analysis)', 'Rods #', 'Segmented', 'Promyelocytes', 'Metamyelocytes', 'Myelocytes', 'Myeloblasts', 'Urine - Esterase', 'Urine - Aspect', 'Urine - pH', 'Urine - Hemoglobin', 'Urine - Bile pigments', 'Urine - Ketone Bodies', 'Urine - Nitrite', 'Urine - Density', 'Urine - Urobilinogen', 'Urine - Protein', 'Urine - Sugar', 'Urine - Leukocytes', 'Urine - Crystals', 'Urine - Red blood cells', 'Urine - Hyaline cylinders', 'Urine - Granular cylinders', 'Urine - Yeasts', 'Urine - Color', 'Partial thromboplastin time\xa0(PTT)\xa0', 'Relationship (Patient/Normal)', 'International normalized ratio (INR)', 'Lactic Dehydrogenase', 'Prothrombin time (PT), Activity', 'Vitamin B12', 'Creatine phosphokinase\xa0(CPK)\xa0', 'Ferritin', 'Arterial Lactic Acid', 'Lipase dosage', 'D-Dimer', 'Albumin', 'Hb saturation (arterial blood gases)', 'pCO2 (arterial blood gas analysis)', 'Base excess (arterial blood gas analysis)', 'pH (arterial blood gas analysis)', 'Total CO2 (arterial blood gas analysis)', 'HCO3 (arterial blood gas analysis)', 'pO2 (arterial blood gas analysis)', 'Arteiral Fio2', 'Phosphor', 'ctO2 (arterial blood gas analysis)']
# Create a subset of the df dataframe comprising continuous data variables
df1=df[['Patient age quantile','Hematocrit', 'Hemoglobin', 'Platelets','Red blood Cells',
'Lymphocytes',
'Mean corpuscular hemoglobin concentration\xa0(MCHC)',
'Leukocytes',
'Basophils',
'Mean corpuscular hemoglobin (MCH)',
'Eosinophils',
'Mean corpuscular volume (MCV)',
'Monocytes',
'Red blood cell distribution width (RDW)',
'Serum Glucose',
'Neutrophils',
'Urea',
'Proteina C reativa mg/dL',
'Creatinine',
'Potassium',
'Sodium',
'Aspartate transaminase',
'Gamma-glutamyltransferase\xa0',
'Total Bilirubin',
'Direct Bilirubin',
'Indirect Bilirubin',
'Alkaline phosphatase',
'Ionized calcium\xa0',
'Magnesium',
'pCO2 (venous blood gas analysis)',
'Hb saturation (venous blood gas analysis)',
'Base excess (venous blood gas analysis)',
'pO2 (venous blood gas analysis)',
'Fio2 (venous blood gas analysis)',
'Total CO2 (venous blood gas analysis)',
'pH (venous blood gas analysis)',
'HCO3 (venous blood gas analysis)',
'Rods #',
'Segmented',
'Promyelocytes',
'Metamyelocytes',
'Myelocytes',
'Myeloblasts',
'Relationship (Patient/Normal)',
'International normalized ratio (INR)',
'Lactic Dehydrogenase',
'Vitamin B12',
'Creatine phosphokinase\xa0(CPK)\xa0',
'Ferritin','Urine - Red blood cells',
'Arterial Lactic Acid',
'Lipase dosage',
'Albumin', 'Hb saturation (arterial blood gases)', 'pCO2 (arterial blood gas analysis)', 'Base excess (arterial blood gas analysis)', 'pH (arterial blood gas analysis)', 'Total CO2 (arterial blood gas analysis)', 'HCO3 (arterial blood gas analysis)', 'pO2 (arterial blood gas analysis)','Arteiral Fio2','Phosphor', 'ctO2 (arterial blood gas analysis)']]
df2=df[['SARS-Cov-2 exam result',
'Patient addmited to regular ward (1=yes, 0=no)',
'Patient addmited to semi-intensive unit (1=yes, 0=no)',
'Patient addmited to intensive care unit (1=yes, 0=no)', 'Respiratory Syncytial Virus',
'Influenza A',
'Influenza B',
'Parainfluenza 1',
'CoronavirusNL63',
'Rhinovirus/Enterovirus',
'Coronavirus HKU1',
'Parainfluenza 3',
'Chlamydophila pneumoniae',
'Adenovirus',
'Parainfluenza 4',
'Coronavirus229E',
'CoronavirusOC43',
'Inf A H1N1 2009',
'Bordetella pertussis',
'Metapneumovirus',
'Parainfluenza 2','Influenza B, rapid test',
'Influenza A, rapid test', 'Strepto A', 'Urine - Esterase',
'Urine - Aspect',
'Urine - Hemoglobin',
'Urine - Bile pigments',
'Urine - Ketone Bodies',
'Urine - Nitrite',
'Urine - Urobilinogen',
'Urine - Protein',
'Urine - Crystals',
'Urine - Hyaline cylinders',
'Urine - Granular cylinders',
'Urine - Yeasts',
'Urine - Color',]]
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
## Code to visualize df columns in histogram_boxplot feature
for feature in df1.columns:
histogram_boxplot(
df1, feature, figsize=(12, 7), kde=False, bins=None,
)
## Code to visualize df columns in histogram_boxplot feature
for feature in df2.columns:
labeled_barplot(
df2, feature, perc= True
)
sns.boxplot(data=df, x='SARS-Cov-2 exam result', y='Patient age quantile')
plt.show()
sns.boxplot(data =df[df['SARS-Cov-2 exam result']=='positive'], x='Patient addmited to intensive care unit (1=yes, 0=no)', y='Patient age quantile')
plt.show()
Among covid cases, patients in the higher age quantiles are more likely to be admitted into the ICU.
sns.violinplot(data=df, x='Patient addmited to intensive care unit (1=yes, 0=no)', y='Patient age quantile', hue="SARS-Cov-2 exam result")
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='pCO2 (arterial blood gas analysis)' )
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='Urea' )
plt.show()
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='International normalized ratio (INR)' )
plt.show()
corr = df.corr() #code to evaluate correlation between variables
corr
| Patient age quantile | Patient addmited to regular ward (1=yes, 0=no) | Patient addmited to semi-intensive unit (1=yes, 0=no) | Patient addmited to intensive care unit (1=yes, 0=no) | Hematocrit | Hemoglobin | Platelets | Mean platelet volume | Red blood Cells | Lymphocytes | Mean corpuscular hemoglobin concentration (MCHC) | Leukocytes | Basophils | Mean corpuscular hemoglobin (MCH) | Eosinophils | Mean corpuscular volume (MCV) | Monocytes | Red blood cell distribution width (RDW) | Serum Glucose | Mycoplasma pneumoniae | Neutrophils | Urea | Proteina C reativa mg/dL | Creatinine | Potassium | Sodium | Alanine transaminase | Aspartate transaminase | Gamma-glutamyltransferase | Total Bilirubin | Direct Bilirubin | Indirect Bilirubin | Alkaline phosphatase | Ionized calcium | Magnesium | pCO2 (venous blood gas analysis) | Hb saturation (venous blood gas analysis) | Base excess (venous blood gas analysis) | pO2 (venous blood gas analysis) | Fio2 (venous blood gas analysis) | Total CO2 (venous blood gas analysis) | pH (venous blood gas analysis) | HCO3 (venous blood gas analysis) | Rods # | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine - Density | Urine - Sugar | Urine - Red blood cells | Partial thromboplastin time (PTT) | Relationship (Patient/Normal) | International normalized ratio (INR) | Lactic Dehydrogenase | Prothrombin time (PT), Activity | Vitamin B12 | Creatine phosphokinase (CPK) | Ferritin | Arterial Lactic Acid | Lipase dosage | D-Dimer | Albumin | Hb saturation (arterial blood gases) | pCO2 (arterial blood gas analysis) | Base excess (arterial blood gas analysis) | pH (arterial blood gas analysis) | Total CO2 (arterial blood gas analysis) | HCO3 (arterial blood gas analysis) | pO2 (arterial blood gas analysis) | Arteiral Fio2 | Phosphor | ctO2 (arterial blood gas analysis) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient age quantile | 1.000 | 0.046 | 0.016 | -0.036 | 0.097 | 0.060 | -0.159 | 0.119 | -0.038 | -0.126 | -0.125 | -0.166 | 0.108 | 0.197 | 0.022 | 0.282 | 0.051 | 0.166 | 0.216 | NaN | 0.087 | 0.338 | 0.088 | 0.373 | 0.002 | -0.005 | 0.129 | 0.039 | 0.224 | 0.146 | 0.268 | 0.008 | -0.481 | -0.310 | -0.128 | 0.208 | -0.059 | 0.555 | -0.071 | NaN | 0.503 | 0.256 | 0.511 | 0.047 | 0.284 | 0.130 | 0.179 | 0.089 | NaN | -0.118 | NaN | 0.160 | NaN | -0.123 | 0.014 | -0.150 | NaN | 0.981 | -0.101 | 0.396 | 0.097 | -0.357 | NaN | -0.137 | -0.224 | -0.469 | 0.570 | 0.571 | 0.086 | 0.166 | -0.098 | -0.335 | -0.512 | -0.061 |
| Patient addmited to regular ward (1=yes, 0=no) | 0.046 | 1.000 | -0.011 | -0.010 | -0.087 | -0.092 | -0.183 | -0.013 | -0.053 | -0.095 | -0.035 | -0.103 | 0.032 | -0.051 | -0.086 | -0.039 | -0.000 | 0.102 | 0.059 | NaN | 0.127 | -0.012 | 0.133 | 0.085 | -0.027 | -0.087 | -0.004 | -0.007 | 0.032 | -0.030 | -0.010 | -0.040 | -0.055 | -0.187 | -0.006 | -0.136 | -0.088 | 0.042 | -0.071 | NaN | -0.041 | 0.181 | -0.034 | 0.085 | 0.063 | -0.031 | -0.034 | -0.070 | NaN | -0.202 | NaN | -0.049 | NaN | 0.024 | -0.100 | 0.118 | NaN | NaN | -0.080 | 0.410 | -0.076 | 0.316 | NaN | NaN | 0.198 | -0.227 | 0.033 | 0.204 | -0.160 | -0.133 | 0.106 | -0.174 | NaN | 0.273 |
| Patient addmited to semi-intensive unit (1=yes, 0=no) | 0.016 | -0.011 | 1.000 | -0.008 | -0.182 | -0.177 | 0.007 | -0.023 | -0.138 | -0.111 | -0.023 | 0.138 | -0.133 | -0.054 | -0.090 | -0.051 | -0.038 | 0.092 | 0.198 | NaN | 0.087 | 0.082 | 0.241 | -0.034 | -0.014 | -0.127 | 0.022 | 0.085 | 0.158 | 0.027 | 0.059 | -0.007 | 0.296 | 0.034 | -0.008 | -0.160 | 0.151 | -0.026 | 0.183 | NaN | -0.104 | 0.136 | -0.097 | 0.185 | 0.083 | 0.239 | 0.279 | 0.416 | NaN | -0.141 | NaN | 0.394 | NaN | -0.183 | 0.083 | 0.191 | NaN | 0.615 | -0.001 | 0.084 | 0.024 | NaN | NaN | -0.661 | -0.559 | 0.113 | -0.226 | -0.179 | -0.113 | -0.137 | -0.339 | -0.091 | 0.185 | -0.049 |
| Patient addmited to intensive care unit (1=yes, 0=no) | -0.036 | -0.010 | -0.008 | 1.000 | -0.184 | -0.179 | 0.126 | -0.074 | -0.121 | -0.110 | -0.036 | 0.272 | -0.121 | -0.090 | -0.089 | -0.078 | -0.104 | 0.194 | 0.124 | NaN | 0.103 | 0.199 | 0.305 | -0.037 | 0.066 | 0.016 | 0.132 | 0.159 | 0.241 | 0.142 | 0.248 | 0.019 | 0.182 | -0.273 | 0.156 | 0.069 | 0.190 | -0.119 | 0.247 | NaN | -0.076 | -0.148 | -0.081 | 0.316 | 0.201 | -0.049 | 0.063 | -0.051 | NaN | -0.101 | NaN | -0.045 | NaN | -0.023 | 0.186 | 0.361 | NaN | NaN | -0.020 | 0.820 | -0.205 | NaN | NaN | NaN | 0.352 | 0.298 | 0.204 | -0.180 | 0.425 | 0.411 | 0.156 | 0.348 | 0.130 | -0.383 |
| Hematocrit | 0.097 | -0.087 | -0.182 | -0.184 | 1.000 | 0.968 | -0.082 | 0.084 | 0.873 | 0.002 | 0.131 | -0.090 | 0.129 | 0.075 | 0.030 | 0.025 | 0.082 | -0.265 | -0.133 | NaN | -0.017 | -0.071 | -0.238 | 0.308 | 0.078 | 0.099 | -0.064 | -0.150 | -0.279 | 0.014 | -0.128 | 0.131 | -0.282 | 0.149 | -0.210 | 0.083 | -0.102 | 0.140 | -0.178 | NaN | 0.177 | 0.045 | 0.176 | -0.219 | 0.072 | -0.252 | -0.329 | -0.428 | NaN | 0.192 | NaN | -0.291 | NaN | -0.035 | -0.050 | -0.303 | NaN | -0.476 | 0.073 | -0.538 | 0.112 | 0.170 | NaN | 0.537 | -0.046 | -0.180 | -0.196 | 0.064 | -0.344 | -0.340 | 0.124 | 0.066 | 0.172 | 0.878 |
| Hemoglobin | 0.060 | -0.092 | -0.177 | -0.179 | 0.968 | 1.000 | -0.120 | 0.079 | 0.841 | -0.004 | 0.372 | -0.102 | 0.116 | 0.185 | 0.019 | 0.028 | 0.095 | -0.342 | -0.152 | NaN | -0.021 | -0.084 | -0.230 | 0.305 | 0.050 | 0.063 | -0.042 | -0.127 | -0.258 | 0.058 | -0.101 | 0.178 | -0.274 | 0.182 | -0.174 | 0.042 | -0.118 | 0.141 | -0.183 | NaN | 0.161 | 0.079 | 0.162 | -0.239 | 0.073 | -0.224 | -0.311 | -0.396 | NaN | 0.179 | NaN | -0.280 | NaN | -0.016 | -0.012 | -0.290 | NaN | -0.810 | 0.079 | -0.537 | 0.034 | 0.184 | NaN | 0.556 | -0.035 | -0.179 | -0.273 | 0.036 | -0.419 | -0.421 | 0.081 | -0.003 | 0.260 | 0.884 |
| Platelets | -0.159 | -0.183 | 0.007 | 0.126 | -0.082 | -0.120 | 1.000 | -0.356 | -0.055 | 0.091 | -0.159 | 0.443 | -0.026 | -0.101 | 0.169 | -0.034 | -0.201 | -0.008 | -0.011 | NaN | -0.058 | -0.013 | 0.004 | -0.183 | 0.204 | 0.038 | -0.058 | -0.129 | -0.061 | -0.058 | -0.089 | -0.018 | 0.257 | 0.129 | 0.029 | 0.031 | 0.053 | -0.176 | 0.068 | NaN | -0.095 | -0.187 | -0.103 | -0.234 | 0.041 | 0.142 | -0.047 | -0.095 | NaN | 0.061 | NaN | -0.238 | NaN | -0.161 | 0.104 | 0.083 | NaN | -0.450 | -0.088 | -0.661 | -0.031 | -0.477 | NaN | 0.295 | 0.083 | 0.539 | -0.296 | -0.525 | 0.200 | 0.134 | -0.138 | 0.472 | 0.125 | -0.483 |
| Mean platelet volume | 0.119 | -0.013 | -0.023 | -0.074 | 0.084 | 0.079 | -0.356 | 1.000 | 0.043 | 0.079 | -0.004 | -0.155 | 0.129 | 0.069 | -0.047 | 0.078 | 0.038 | 0.045 | 0.063 | NaN | -0.081 | 0.093 | -0.062 | 0.122 | -0.004 | 0.108 | -0.015 | 0.050 | 0.081 | 0.039 | 0.133 | -0.050 | -0.211 | 0.110 | -0.279 | -0.010 | -0.079 | 0.183 | -0.094 | NaN | 0.147 | 0.145 | 0.153 | 0.264 | 0.003 | -0.165 | 0.047 | 0.078 | NaN | -0.003 | NaN | 0.084 | NaN | -0.010 | 0.103 | -0.205 | NaN | -0.999 | 0.210 | 0.024 | 0.302 | 0.084 | NaN | 0.539 | -0.351 | 0.090 | 0.162 | -0.010 | 0.262 | 0.267 | -0.226 | 0.080 | -0.222 | 0.018 |
| Red blood Cells | -0.038 | -0.053 | -0.138 | -0.121 | 0.873 | 0.841 | -0.055 | 0.043 | 1.000 | -0.010 | 0.090 | -0.036 | 0.079 | -0.367 | -0.004 | -0.459 | 0.045 | -0.138 | -0.037 | NaN | 0.013 | -0.121 | -0.165 | 0.206 | 0.042 | 0.060 | -0.022 | -0.069 | -0.301 | 0.015 | -0.138 | 0.140 | -0.013 | 0.042 | -0.130 | -0.016 | -0.028 | -0.029 | -0.077 | NaN | 0.007 | -0.009 | 0.006 | -0.245 | 0.013 | -0.236 | -0.367 | -0.445 | NaN | 0.219 | NaN | -0.275 | NaN | -0.021 | -0.087 | -0.066 | NaN | 0.323 | 0.103 | -0.462 | -0.013 | 0.612 | NaN | 0.441 | 0.029 | -0.351 | 0.040 | 0.260 | -0.303 | -0.268 | 0.200 | -0.258 | 0.190 | 0.848 |
| Lymphocytes | -0.126 | -0.095 | -0.111 | -0.110 | 0.002 | -0.004 | 0.091 | 0.079 | -0.010 | 1.000 | -0.028 | -0.331 | 0.235 | 0.015 | 0.200 | 0.027 | 0.065 | -0.080 | -0.182 | NaN | -0.935 | -0.108 | -0.356 | -0.175 | 0.113 | 0.209 | -0.105 | -0.122 | -0.135 | -0.206 | -0.242 | -0.127 | 0.067 | 0.482 | -0.274 | 0.074 | -0.106 | -0.105 | -0.144 | NaN | -0.006 | -0.173 | -0.016 | -0.243 | -0.933 | -0.088 | -0.085 | -0.039 | NaN | 0.196 | NaN | -0.024 | NaN | 0.167 | -0.181 | -0.048 | NaN | -0.314 | -0.125 | -0.529 | -0.177 | -0.421 | NaN | 0.266 | 0.101 | 0.500 | -0.403 | -0.545 | 0.083 | 0.013 | 0.058 | 0.227 | 0.109 | -0.136 |
| Mean corpuscular hemoglobin concentration (MCHC) | -0.125 | -0.035 | -0.023 | -0.036 | 0.131 | 0.372 | -0.159 | -0.004 | 0.090 | -0.028 | 1.000 | -0.066 | -0.026 | 0.474 | -0.042 | 0.035 | 0.070 | -0.394 | -0.117 | NaN | -0.022 | -0.083 | -0.025 | 0.055 | -0.104 | -0.145 | 0.081 | 0.072 | 0.032 | 0.175 | 0.091 | 0.205 | -0.023 | 0.165 | 0.085 | -0.169 | -0.074 | 0.010 | -0.033 | NaN | -0.053 | 0.147 | -0.045 | -0.162 | 0.048 | 0.072 | -0.021 | -0.017 | NaN | -0.042 | NaN | -0.063 | NaN | 0.071 | 0.149 | -0.009 | NaN | -0.928 | 0.050 | -0.092 | -0.179 | 0.032 | NaN | 0.068 | 0.009 | -0.090 | -0.344 | -0.048 | -0.411 | -0.427 | -0.067 | -0.166 | 0.478 | 0.384 |
| Leukocytes | -0.166 | -0.103 | 0.138 | 0.272 | -0.090 | -0.102 | 0.443 | -0.155 | -0.036 | -0.331 | -0.066 | 1.000 | -0.304 | -0.124 | -0.092 | -0.103 | -0.295 | 0.128 | 0.185 | NaN | 0.402 | 0.115 | 0.361 | -0.054 | 0.017 | -0.050 | 0.022 | 0.029 | 0.071 | 0.158 | 0.197 | 0.087 | 0.271 | -0.221 | 0.123 | -0.182 | 0.162 | -0.270 | 0.187 | NaN | -0.285 | -0.026 | -0.287 | 0.163 | 0.354 | 0.088 | 0.081 | -0.028 | NaN | -0.038 | NaN | -0.167 | NaN | -0.098 | 0.094 | 0.278 | NaN | 0.839 | 0.055 | 0.427 | 0.184 | -0.348 | NaN | -0.234 | 0.024 | 0.490 | -0.493 | -0.563 | 0.043 | -0.032 | -0.262 | 0.828 | 0.322 | -0.201 |
| Basophils | 0.108 | 0.032 | -0.133 | -0.121 | 0.129 | 0.116 | -0.026 | 0.129 | 0.079 | 0.235 | -0.026 | -0.304 | 1.000 | 0.065 | 0.335 | 0.085 | 0.099 | 0.038 | -0.076 | NaN | -0.373 | -0.020 | -0.224 | 0.082 | 0.170 | 0.116 | -0.005 | -0.052 | -0.001 | 0.039 | 0.016 | 0.049 | -0.131 | 0.023 | -0.354 | 0.293 | -0.066 | 0.276 | -0.123 | NaN | 0.352 | -0.035 | 0.345 | 0.255 | -0.059 | -0.022 | 0.080 | -0.050 | NaN | 0.006 | NaN | -0.183 | NaN | -0.023 | -0.064 | -0.134 | NaN | -0.323 | -0.042 | -0.051 | -0.293 | -0.693 | NaN | 0.277 | 0.059 | -0.315 | 0.369 | 0.431 | 0.054 | 0.101 | 0.190 | -0.322 | -0.214 | -0.053 |
| Mean corpuscular hemoglobin (MCH) | 0.197 | -0.051 | -0.054 | -0.090 | 0.075 | 0.185 | -0.101 | 0.069 | -0.367 | 0.015 | 0.474 | -0.124 | 0.065 | 1.000 | 0.030 | 0.895 | 0.093 | -0.300 | -0.188 | NaN | -0.064 | 0.101 | -0.105 | 0.192 | 0.012 | -0.008 | -0.042 | -0.099 | 0.136 | 0.084 | 0.118 | 0.035 | -0.439 | 0.307 | -0.052 | 0.108 | -0.138 | 0.309 | -0.146 | NaN | 0.278 | 0.156 | 0.282 | -0.013 | 0.122 | 0.016 | 0.089 | 0.075 | NaN | -0.080 | NaN | -0.040 | NaN | 0.000 | 0.156 | -0.374 | NaN | -0.840 | -0.037 | -0.114 | 0.092 | -0.253 | NaN | 0.179 | -0.139 | 0.321 | -0.607 | -0.431 | -0.239 | -0.309 | -0.210 | 0.422 | 0.277 | 0.130 |
| Eosinophils | 0.022 | -0.086 | -0.090 | -0.089 | 0.030 | 0.019 | 0.169 | -0.047 | -0.004 | 0.200 | -0.042 | -0.092 | 0.335 | 0.030 | 1.000 | 0.054 | 0.009 | -0.008 | -0.025 | NaN | -0.383 | 0.139 | -0.180 | 0.015 | 0.095 | 0.223 | 0.030 | -0.068 | -0.013 | -0.065 | -0.092 | -0.027 | -0.167 | 0.305 | 0.184 | 0.093 | 0.040 | 0.216 | 0.004 | NaN | 0.232 | 0.069 | 0.233 | -0.209 | -0.146 | -0.071 | 0.075 | -0.063 | NaN | 0.097 | NaN | -0.085 | NaN | 0.017 | 0.142 | -0.334 | NaN | 0.183 | -0.032 | -0.289 | -0.149 | -0.490 | NaN | 0.149 | 0.296 | -0.022 | 0.357 | 0.131 | 0.334 | 0.357 | 0.134 | -0.118 | -0.308 | -0.433 |
| Mean corpuscular volume (MCV) | 0.282 | -0.039 | -0.051 | -0.078 | 0.025 | 0.028 | -0.034 | 0.078 | -0.459 | 0.027 | 0.035 | -0.103 | 0.085 | 0.895 | 0.054 | 1.000 | 0.066 | -0.152 | -0.163 | NaN | -0.059 | 0.152 | -0.103 | 0.184 | 0.071 | 0.059 | -0.081 | -0.140 | 0.138 | 0.014 | 0.092 | -0.054 | -0.482 | 0.256 | -0.112 | 0.212 | -0.115 | 0.334 | -0.146 | NaN | 0.334 | 0.091 | 0.334 | 0.067 | 0.120 | -0.010 | 0.101 | 0.079 | NaN | -0.095 | NaN | -0.019 | NaN | -0.026 | 0.104 | -0.398 | NaN | -0.807 | -0.066 | -0.084 | 0.225 | -0.323 | NaN | 0.131 | -0.177 | 0.466 | -0.528 | -0.510 | -0.020 | -0.098 | -0.216 | 0.640 | 0.068 | -0.082 |
| Monocytes | 0.051 | -0.000 | -0.038 | -0.104 | 0.082 | 0.095 | -0.201 | 0.038 | 0.045 | 0.065 | 0.070 | -0.295 | 0.099 | 0.093 | 0.009 | 0.066 | 1.000 | -0.016 | -0.211 | NaN | -0.299 | -0.060 | -0.050 | 0.114 | -0.047 | -0.021 | -0.034 | -0.082 | -0.078 | 0.033 | 0.014 | 0.042 | -0.196 | -0.028 | -0.356 | 0.023 | -0.071 | 0.097 | -0.100 | NaN | 0.071 | 0.040 | 0.074 | 0.015 | -0.349 | -0.037 | 0.045 | 0.048 | NaN | -0.028 | NaN | -0.102 | NaN | 0.101 | 0.060 | 0.009 | NaN | -0.922 | -0.049 | 0.262 | -0.454 | -0.652 | NaN | -0.106 | -0.025 | 0.385 | -0.241 | -0.409 | 0.126 | 0.076 | -0.009 | -0.123 | 0.043 | 0.078 |
| Red blood cell distribution width (RDW) | 0.166 | 0.102 | 0.092 | 0.194 | -0.265 | -0.342 | -0.008 | 0.045 | -0.138 | -0.080 | -0.394 | 0.128 | 0.038 | -0.300 | -0.008 | -0.152 | -0.016 | 1.000 | 0.296 | NaN | 0.053 | 0.215 | 0.179 | 0.157 | 0.066 | -0.031 | 0.078 | 0.156 | 0.292 | 0.243 | 0.419 | 0.037 | 0.115 | -0.078 | 0.171 | 0.104 | -0.118 | 0.095 | -0.074 | NaN | 0.121 | -0.036 | 0.119 | 0.167 | 0.096 | 0.116 | 0.231 | 0.188 | NaN | -0.055 | NaN | 0.170 | NaN | 0.050 | -0.003 | 0.335 | NaN | 0.172 | 0.027 | 0.864 | 0.296 | -0.189 | NaN | -0.384 | -0.071 | 0.393 | -0.048 | -0.318 | 0.367 | 0.331 | -0.113 | 0.402 | 0.125 | -0.316 |
| Serum Glucose | 0.216 | 0.059 | 0.198 | 0.124 | -0.133 | -0.152 | -0.011 | 0.063 | -0.037 | -0.182 | -0.117 | 0.185 | -0.076 | -0.188 | -0.025 | -0.163 | -0.211 | 0.296 | 1.000 | NaN | 0.176 | 0.187 | 0.214 | 0.103 | 0.008 | -0.180 | 0.414 | 0.483 | 0.278 | 0.100 | 0.261 | -0.072 | 0.031 | -0.381 | 0.328 | -0.022 | 0.159 | 0.135 | 0.164 | NaN | 0.097 | 0.105 | 0.106 | 0.269 | 0.352 | NaN | 0.148 | -0.046 | NaN | -0.141 | NaN | -0.032 | NaN | 0.131 | 0.158 | 0.184 | NaN | NaN | -0.018 | 0.901 | 0.399 | -1.000 | NaN | -0.791 | 0.025 | 0.112 | 0.213 | 0.121 | 0.207 | 0.212 | -0.157 | -0.316 | 0.139 | -0.473 |
| Mycoplasma pneumoniae | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Neutrophils | 0.087 | 0.127 | 0.087 | 0.103 | -0.017 | -0.021 | -0.058 | -0.081 | 0.013 | -0.935 | -0.022 | 0.402 | -0.373 | -0.064 | -0.383 | -0.059 | -0.299 | 0.053 | 0.176 | NaN | 1.000 | -0.008 | 0.336 | 0.083 | -0.134 | -0.206 | 0.077 | 0.109 | 0.025 | 0.042 | 0.068 | 0.020 | 0.017 | -0.366 | 0.263 | -0.037 | 0.117 | -0.040 | 0.168 | NaN | -0.101 | 0.043 | -0.097 | 0.127 | 0.882 | NaN | -0.240 | NaN | NaN | -0.289 | NaN | 0.032 | NaN | -0.111 | 0.092 | 0.122 | NaN | -1.000 | 0.105 | -0.116 | 0.355 | 0.884 | NaN | 0.393 | -0.123 | -0.252 | 0.131 | 0.363 | -0.041 | -0.019 | -0.211 | 0.650 | 0.146 | -0.063 |
| Urea | 0.338 | -0.012 | 0.082 | 0.199 | -0.071 | -0.084 | -0.013 | 0.093 | -0.121 | -0.108 | -0.083 | 0.115 | -0.020 | 0.101 | 0.139 | 0.152 | -0.060 | 0.215 | 0.187 | NaN | -0.008 | 1.000 | 0.175 | 0.522 | 0.120 | 0.182 | 0.039 | -0.026 | 0.546 | 0.328 | 0.546 | 0.073 | -0.020 | 0.008 | 0.269 | 0.041 | -0.009 | 0.351 | 0.033 | NaN | 0.285 | 0.235 | 0.295 | 0.129 | 0.290 | -0.030 | 0.314 | 0.306 | NaN | 0.199 | NaN | 0.281 | NaN | -0.272 | 0.124 | 0.044 | NaN | 0.970 | 0.113 | 0.549 | 0.244 | -0.029 | NaN | -0.402 | 0.215 | 0.186 | 0.255 | 0.008 | 0.329 | 0.326 | -0.051 | 0.382 | -0.072 | -0.550 |
| Proteina C reativa mg/dL | 0.088 | 0.133 | 0.241 | 0.305 | -0.238 | -0.230 | 0.004 | -0.062 | -0.165 | -0.356 | -0.025 | 0.361 | -0.224 | -0.105 | -0.180 | -0.103 | -0.050 | 0.179 | 0.214 | NaN | 0.336 | 0.175 | 1.000 | 0.192 | 0.036 | -0.241 | 0.253 | 0.227 | 0.208 | 0.213 | 0.286 | 0.084 | 0.061 | -0.418 | 0.005 | -0.038 | 0.012 | -0.029 | 0.057 | NaN | -0.050 | 0.013 | -0.050 | 0.507 | 0.292 | 0.087 | 0.313 | 0.281 | NaN | 0.013 | NaN | 0.203 | NaN | -0.027 | 0.279 | 0.298 | NaN | 0.634 | -0.041 | 0.781 | 0.147 | 0.405 | NaN | -0.851 | 0.232 | -0.254 | 0.035 | 0.214 | -0.198 | -0.180 | 0.095 | 0.382 | -0.307 | -0.062 |
| Creatinine | 0.373 | 0.085 | -0.034 | -0.037 | 0.308 | 0.305 | -0.183 | 0.122 | 0.206 | -0.175 | 0.055 | -0.054 | 0.082 | 0.192 | 0.015 | 0.184 | 0.114 | 0.157 | 0.103 | NaN | 0.083 | 0.522 | 0.192 | 1.000 | 0.075 | 0.053 | 0.072 | -0.074 | 0.299 | 0.351 | 0.400 | 0.227 | -0.436 | -0.175 | -0.011 | 0.204 | -0.213 | 0.382 | -0.169 | NaN | 0.389 | 0.099 | 0.391 | 0.141 | 0.259 | -0.100 | 0.242 | 0.154 | NaN | 0.140 | NaN | 0.199 | NaN | -0.111 | 0.146 | -0.195 | NaN | 0.748 | 0.145 | 0.406 | 0.229 | 0.737 | NaN | 0.321 | 0.093 | -0.012 | -0.104 | -0.081 | -0.070 | -0.074 | 0.073 | 0.717 | -0.115 | 0.157 |
| Potassium | 0.002 | -0.027 | -0.014 | 0.066 | 0.078 | 0.050 | 0.204 | -0.004 | 0.042 | 0.113 | -0.104 | 0.017 | 0.170 | 0.012 | 0.095 | 0.071 | -0.047 | 0.066 | 0.008 | NaN | -0.134 | 0.120 | 0.036 | 0.075 | 1.000 | 0.016 | -0.110 | -0.071 | -0.010 | 0.038 | 0.003 | 0.063 | 0.164 | -0.119 | -0.233 | 0.191 | 0.016 | -0.178 | 0.063 | NaN | -0.026 | -0.323 | -0.043 | 0.054 | 0.062 | 0.027 | -0.000 | -0.093 | NaN | 0.397 | NaN | 0.083 | NaN | -0.183 | 0.129 | 0.225 | NaN | -0.372 | -0.136 | -0.207 | 0.184 | -0.095 | NaN | 0.425 | 0.093 | 0.028 | -0.008 | 0.012 | 0.018 | 0.012 | 0.090 | 0.364 | 0.136 | -0.132 |
| Sodium | -0.005 | -0.087 | -0.127 | 0.016 | 0.099 | 0.063 | 0.038 | 0.108 | 0.060 | 0.209 | -0.145 | -0.050 | 0.116 | -0.008 | 0.223 | 0.059 | -0.021 | -0.031 | -0.180 | NaN | -0.206 | 0.182 | -0.241 | 0.053 | 0.016 | 1.000 | -0.189 | -0.245 | 0.207 | 0.050 | 0.157 | -0.065 | -0.049 | 0.302 | 0.193 | 0.053 | -0.147 | 0.162 | -0.157 | NaN | 0.188 | 0.066 | 0.189 | 0.069 | -0.214 | -0.453 | -0.043 | 0.014 | NaN | 0.067 | NaN | -0.157 | NaN | -0.142 | -0.133 | -0.150 | NaN | 0.658 | -0.100 | -0.350 | -0.053 | 0.023 | NaN | 0.235 | 0.255 | 0.155 | 0.130 | -0.082 | 0.309 | 0.303 | 0.099 | 0.113 | -0.413 | -0.183 |
| Alanine transaminase | 0.129 | -0.004 | 0.022 | 0.132 | -0.064 | -0.042 | -0.058 | -0.015 | -0.022 | -0.105 | 0.081 | 0.022 | -0.005 | -0.042 | 0.030 | -0.081 | -0.034 | 0.078 | 0.414 | NaN | 0.077 | 0.039 | 0.253 | 0.072 | -0.110 | -0.189 | 1.000 | 0.840 | 0.604 | 0.104 | 0.191 | 0.006 | 0.265 | -0.157 | 0.252 | -0.009 | 0.005 | 0.080 | 0.125 | NaN | 0.069 | 0.074 | 0.071 | 0.600 | 0.109 | NaN | 0.257 | 0.192 | NaN | -0.266 | NaN | -0.071 | NaN | 0.126 | 0.228 | 0.194 | NaN | 0.928 | 0.037 | 0.785 | -0.009 | 0.199 | NaN | -0.220 | -0.162 | 0.094 | -0.132 | -0.246 | -0.034 | -0.044 | -0.214 | -0.201 | -0.021 | 0.069 |
| Aspartate transaminase | 0.039 | -0.007 | 0.085 | 0.159 | -0.150 | -0.127 | -0.129 | 0.050 | -0.069 | -0.122 | 0.072 | 0.029 | -0.052 | -0.099 | -0.068 | -0.140 | -0.082 | 0.156 | 0.483 | NaN | 0.109 | -0.026 | 0.227 | -0.074 | -0.071 | -0.245 | 0.840 | 1.000 | 0.553 | 0.044 | 0.156 | -0.059 | 0.464 | -0.195 | 0.289 | -0.088 | 0.109 | -0.067 | 0.262 | NaN | -0.083 | 0.019 | -0.080 | 0.393 | 0.105 | NaN | 0.245 | 0.047 | NaN | -0.144 | NaN | -0.075 | NaN | 0.322 | 0.324 | 0.431 | NaN | 0.684 | 0.034 | 0.816 | -0.035 | 0.310 | NaN | -0.486 | -0.246 | 0.022 | -0.173 | -0.198 | -0.090 | -0.100 | -0.226 | -0.222 | 0.032 | 0.059 |
| Gamma-glutamyltransferase | 0.224 | 0.032 | 0.158 | 0.241 | -0.279 | -0.258 | -0.061 | 0.081 | -0.301 | -0.135 | 0.032 | 0.071 | -0.001 | 0.136 | -0.013 | 0.138 | -0.078 | 0.292 | 0.278 | NaN | 0.025 | 0.546 | 0.208 | 0.299 | -0.010 | 0.207 | 0.604 | 0.553 | 1.000 | 0.281 | 0.517 | 0.012 | 0.254 | -0.090 | 0.401 | -0.114 | 0.039 | 0.353 | 0.143 | NaN | 0.254 | 0.456 | 0.270 | 0.168 | 0.271 | NaN | 0.618 | 0.944 | NaN | -0.359 | NaN | -0.105 | NaN | -0.145 | -0.036 | 0.332 | NaN | 0.375 | -0.051 | 0.841 | -0.301 | 0.350 | NaN | -0.273 | -0.107 | -0.225 | -0.086 | 0.201 | -0.146 | -0.145 | 0.357 | -0.745 | 0.088 | -0.082 |
| Total Bilirubin | 0.146 | -0.030 | 0.027 | 0.142 | 0.014 | 0.058 | -0.058 | 0.039 | 0.015 | -0.206 | 0.175 | 0.158 | 0.039 | 0.084 | -0.065 | 0.014 | 0.033 | 0.243 | 0.100 | NaN | 0.042 | 0.328 | 0.213 | 0.351 | 0.038 | 0.050 | 0.104 | 0.044 | 0.281 | 1.000 | 0.847 | 0.894 | 0.073 | -0.320 | 0.300 | -0.043 | -0.176 | 0.166 | -0.135 | NaN | 0.108 | 0.165 | 0.116 | 0.087 | 0.430 | NaN | 0.358 | 0.304 | NaN | -0.213 | NaN | 0.365 | NaN | -0.121 | 0.247 | 0.297 | NaN | NaN | 0.015 | 0.839 | 0.420 | 0.234 | NaN | 0.159 | -0.124 | 0.087 | -0.232 | -0.318 | -0.148 | -0.158 | -0.056 | 0.420 | 0.402 | 0.254 |
| Direct Bilirubin | 0.268 | -0.010 | 0.059 | 0.248 | -0.128 | -0.101 | -0.089 | 0.133 | -0.138 | -0.242 | 0.091 | 0.197 | 0.016 | 0.118 | -0.092 | 0.092 | 0.014 | 0.419 | 0.261 | NaN | 0.068 | 0.546 | 0.286 | 0.400 | 0.003 | 0.157 | 0.191 | 0.156 | 0.517 | 0.847 | 1.000 | 0.518 | 0.103 | -0.269 | 0.412 | -0.053 | -0.175 | 0.299 | -0.154 | NaN | 0.205 | 0.266 | 0.218 | 0.069 | 0.363 | NaN | 0.335 | 0.211 | NaN | -0.294 | NaN | 0.171 | NaN | -0.127 | 0.262 | 0.375 | NaN | NaN | -0.035 | 0.947 | 0.444 | -0.339 | NaN | -0.087 | -0.108 | -0.071 | -0.212 | -0.095 | -0.201 | -0.208 | -0.032 | 0.398 | 0.394 | 0.068 |
| Indirect Bilirubin | 0.008 | -0.040 | -0.007 | 0.019 | 0.131 | 0.178 | -0.018 | -0.050 | 0.140 | -0.127 | 0.205 | 0.087 | 0.049 | 0.035 | -0.027 | -0.054 | 0.042 | 0.037 | -0.072 | NaN | 0.020 | 0.073 | 0.084 | 0.227 | 0.063 | -0.065 | 0.006 | -0.059 | 0.012 | 0.894 | 0.518 | 1.000 | 0.030 | -0.329 | -0.085 | -0.022 | -0.132 | -0.028 | -0.079 | NaN | -0.030 | 0.008 | -0.030 | 0.073 | 0.325 | NaN | 0.222 | 0.298 | NaN | -0.129 | NaN | 0.437 | NaN | -0.092 | 0.186 | 0.160 | NaN | NaN | 0.047 | 0.519 | 0.283 | 0.348 | NaN | 0.244 | -0.113 | 0.257 | -0.196 | -0.507 | -0.044 | -0.057 | -0.071 | 0.334 | 0.259 | 0.416 |
| Alkaline phosphatase | -0.481 | -0.055 | 0.296 | 0.182 | -0.282 | -0.274 | 0.257 | -0.211 | -0.013 | 0.067 | -0.023 | 0.271 | -0.131 | -0.439 | -0.167 | -0.482 | -0.196 | 0.115 | 0.031 | NaN | 0.017 | -0.020 | 0.061 | -0.436 | 0.164 | -0.049 | 0.265 | 0.464 | 0.254 | 0.073 | 0.103 | 0.030 | 1.000 | 0.134 | 0.265 | -0.504 | 0.284 | -0.520 | 0.332 | NaN | -0.581 | -0.062 | -0.574 | -0.085 | -0.042 | NaN | 0.077 | 0.204 | NaN | -0.206 | NaN | -0.164 | NaN | 0.131 | 0.245 | 0.563 | NaN | 0.833 | 0.126 | 0.044 | -0.345 | -0.122 | NaN | -0.247 | -0.351 | -0.132 | 0.039 | 0.235 | -0.066 | -0.066 | 0.107 | 0.180 | 0.332 | 0.071 |
| Ionized calcium | -0.310 | -0.187 | 0.034 | -0.273 | 0.149 | 0.182 | 0.129 | 0.110 | 0.042 | 0.482 | 0.165 | -0.221 | 0.023 | 0.307 | 0.305 | 0.256 | -0.028 | -0.078 | -0.381 | NaN | -0.366 | 0.008 | -0.418 | -0.175 | -0.119 | 0.302 | -0.157 | -0.195 | -0.090 | -0.320 | -0.269 | -0.329 | 0.134 | 1.000 | 0.045 | 0.117 | 0.003 | -0.069 | 0.023 | NaN | 0.033 | -0.172 | 0.024 | -0.464 | -0.412 | -0.293 | -0.451 | -0.188 | NaN | 0.217 | NaN | -0.404 | NaN | 0.258 | -0.140 | -0.427 | NaN | -1.000 | 0.084 | -0.949 | 0.651 | NaN | NaN | 0.517 | -0.525 | 0.916 | -0.616 | -0.834 | 0.365 | 0.153 | -0.374 | 0.081 | 0.384 | -0.114 |
| Magnesium | -0.128 | -0.006 | -0.008 | 0.156 | -0.210 | -0.174 | 0.029 | -0.279 | -0.130 | -0.274 | 0.085 | 0.123 | -0.354 | -0.052 | 0.184 | -0.112 | -0.356 | 0.171 | 0.328 | NaN | 0.263 | 0.269 | 0.005 | -0.011 | -0.233 | 0.193 | 0.252 | 0.289 | 0.401 | 0.300 | 0.412 | -0.085 | 0.265 | 0.045 | 1.000 | -0.026 | -0.063 | 0.190 | 0.043 | NaN | 0.210 | 0.120 | 0.210 | -0.478 | 0.644 | NaN | 0.000 | 0.181 | NaN | -0.430 | NaN | -0.470 | NaN | 0.378 | -0.058 | 0.137 | NaN | -0.375 | 0.101 | 0.641 | 0.075 | 1.000 | NaN | 0.012 | 0.141 | -0.188 | 0.453 | 0.325 | 0.521 | 0.557 | -0.213 | 0.168 | 0.152 | -0.691 |
| pCO2 (venous blood gas analysis) | 0.208 | -0.136 | -0.160 | 0.069 | 0.083 | 0.042 | 0.031 | -0.010 | -0.016 | 0.074 | -0.169 | -0.182 | 0.293 | 0.108 | 0.093 | 0.212 | 0.023 | 0.104 | -0.022 | NaN | -0.037 | 0.041 | -0.038 | 0.204 | 0.191 | 0.053 | -0.009 | -0.088 | -0.114 | -0.043 | -0.053 | -0.022 | -0.504 | 0.117 | -0.026 | 1.000 | -0.409 | 0.389 | -0.369 | NaN | 0.726 | -0.653 | 0.693 | 0.171 | -0.011 | 0.296 | 0.118 | -0.004 | NaN | 0.399 | NaN | 0.192 | NaN | -0.063 | -0.068 | -0.173 | NaN | NaN | 0.019 | -0.766 | 0.160 | -1.000 | NaN | NaN | 0.584 | 0.246 | 0.389 | 0.023 | 0.540 | 0.536 | 0.486 | 0.092 | 0.032 | -0.333 |
| Hb saturation (venous blood gas analysis) | -0.059 | -0.088 | 0.151 | 0.190 | -0.102 | -0.118 | 0.053 | -0.079 | -0.028 | -0.106 | -0.074 | 0.162 | -0.066 | -0.138 | 0.040 | -0.115 | -0.071 | -0.118 | 0.159 | NaN | 0.117 | -0.009 | 0.012 | -0.213 | 0.016 | -0.147 | 0.005 | 0.109 | 0.039 | -0.176 | -0.175 | -0.132 | 0.284 | 0.003 | -0.063 | -0.409 | 1.000 | -0.142 | 0.911 | NaN | -0.325 | 0.322 | -0.308 | -0.051 | 0.094 | -0.007 | 0.018 | -0.168 | NaN | -0.381 | NaN | -0.234 | NaN | -0.045 | 0.030 | -0.106 | NaN | NaN | -0.154 | 0.995 | 0.669 | 1.000 | NaN | NaN | -0.032 | 0.513 | -0.353 | -0.706 | 0.063 | 0.025 | -0.283 | 0.833 | -0.108 | 0.250 |
| Base excess (venous blood gas analysis) | 0.555 | 0.042 | -0.026 | -0.119 | 0.140 | 0.141 | -0.176 | 0.183 | -0.029 | -0.105 | 0.010 | -0.270 | 0.276 | 0.309 | 0.216 | 0.334 | 0.097 | 0.095 | 0.135 | NaN | -0.040 | 0.351 | -0.029 | 0.382 | -0.178 | 0.162 | 0.080 | -0.067 | 0.353 | 0.166 | 0.299 | -0.028 | -0.520 | -0.069 | 0.190 | 0.389 | -0.142 | 1.000 | -0.218 | NaN | 0.903 | 0.431 | 0.923 | 0.043 | 0.265 | 0.331 | 0.257 | 0.223 | NaN | 0.064 | NaN | 0.048 | NaN | -0.251 | 0.072 | -0.312 | NaN | NaN | 0.037 | 0.334 | -0.284 | -1.000 | NaN | NaN | 0.341 | -0.107 | 0.932 | 0.729 | 0.647 | 0.678 | 0.360 | -0.307 | -0.456 | -0.813 |
| pO2 (venous blood gas analysis) | -0.071 | -0.071 | 0.183 | 0.247 | -0.178 | -0.183 | 0.068 | -0.094 | -0.077 | -0.144 | -0.033 | 0.187 | -0.123 | -0.146 | 0.004 | -0.146 | -0.100 | -0.074 | 0.164 | NaN | 0.168 | 0.033 | 0.057 | -0.169 | 0.063 | -0.157 | 0.125 | 0.262 | 0.143 | -0.135 | -0.154 | -0.079 | 0.332 | 0.023 | 0.043 | -0.369 | 0.911 | -0.218 | 1.000 | NaN | -0.365 | 0.214 | -0.352 | -0.049 | 0.115 | -0.055 | 0.022 | -0.171 | NaN | -0.353 | NaN | -0.204 | NaN | -0.053 | 0.022 | -0.025 | NaN | NaN | -0.116 | 0.951 | 0.648 | 1.000 | NaN | NaN | -0.149 | 0.378 | -0.335 | -0.552 | -0.028 | -0.060 | -0.340 | 0.878 | -0.118 | 0.294 |
| Fio2 (venous blood gas analysis) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Total CO2 (venous blood gas analysis) | 0.503 | -0.041 | -0.104 | -0.076 | 0.177 | 0.161 | -0.095 | 0.147 | 0.007 | -0.006 | -0.053 | -0.285 | 0.352 | 0.278 | 0.232 | 0.334 | 0.071 | 0.121 | 0.097 | NaN | -0.101 | 0.285 | -0.050 | 0.389 | -0.026 | 0.188 | 0.069 | -0.083 | 0.254 | 0.108 | 0.205 | -0.030 | -0.581 | 0.033 | 0.210 | 0.726 | -0.325 | 0.903 | -0.365 | NaN | 1.000 | 0.020 | 0.999 | 0.085 | 0.170 | 0.370 | 0.232 | 0.137 | NaN | 0.227 | NaN | 0.057 | NaN | -0.203 | 0.016 | -0.299 | NaN | NaN | 0.041 | 0.131 | -0.101 | -1.000 | NaN | NaN | 0.523 | 0.079 | 0.859 | 0.503 | 0.742 | 0.762 | 0.501 | -0.237 | -0.336 | -0.741 |
| pH (venous blood gas analysis) | 0.256 | 0.181 | 0.136 | -0.148 | 0.045 | 0.079 | -0.187 | 0.145 | -0.009 | -0.173 | 0.147 | -0.026 | -0.035 | 0.156 | 0.069 | 0.091 | 0.040 | -0.036 | 0.105 | NaN | 0.043 | 0.235 | 0.013 | 0.099 | -0.323 | 0.066 | 0.074 | 0.019 | 0.456 | 0.165 | 0.266 | 0.008 | -0.062 | -0.172 | 0.120 | -0.653 | 0.322 | 0.431 | 0.214 | NaN | 0.020 | 1.000 | 0.067 | -0.083 | 0.224 | -0.071 | 0.047 | 0.150 | NaN | -0.390 | NaN | -0.196 | NaN | -0.159 | 0.107 | -0.139 | NaN | NaN | 0.016 | 0.474 | -0.433 | 1.000 | NaN | NaN | -0.214 | -0.368 | 0.491 | 0.685 | 0.068 | 0.101 | -0.095 | -0.464 | -0.461 | -0.417 |
| HCO3 (venous blood gas analysis) | 0.511 | -0.034 | -0.097 | -0.081 | 0.176 | 0.162 | -0.103 | 0.153 | 0.006 | -0.016 | -0.045 | -0.287 | 0.345 | 0.282 | 0.233 | 0.334 | 0.074 | 0.119 | 0.106 | NaN | -0.097 | 0.295 | -0.050 | 0.391 | -0.043 | 0.189 | 0.071 | -0.080 | 0.270 | 0.116 | 0.218 | -0.030 | -0.574 | 0.024 | 0.210 | 0.693 | -0.308 | 0.923 | -0.352 | NaN | 0.999 | 0.067 | 1.000 | 0.071 | 0.181 | 0.368 | 0.236 | 0.147 | NaN | 0.208 | NaN | 0.052 | NaN | -0.207 | 0.024 | -0.301 | NaN | NaN | 0.043 | 0.154 | -0.119 | -1.000 | NaN | NaN | 0.506 | 0.063 | 0.876 | 0.530 | 0.740 | 0.761 | 0.496 | -0.255 | -0.353 | -0.753 |
| Rods # | 0.047 | 0.085 | 0.185 | 0.316 | -0.219 | -0.239 | -0.234 | 0.264 | -0.245 | -0.243 | -0.162 | 0.163 | 0.255 | -0.013 | -0.209 | 0.067 | 0.015 | 0.167 | 0.269 | NaN | 0.127 | 0.129 | 0.507 | 0.141 | 0.054 | 0.069 | 0.600 | 0.393 | 0.168 | 0.087 | 0.069 | 0.073 | -0.085 | -0.464 | -0.478 | 0.171 | -0.051 | 0.043 | -0.049 | NaN | 0.085 | -0.083 | 0.071 | 1.000 | 0.079 | 0.066 | 0.435 | 0.320 | NaN | -0.264 | NaN | 0.223 | NaN | -0.346 | -0.096 | 0.036 | NaN | NaN | -0.163 | 1.000 | -0.047 | -1.000 | NaN | NaN | 0.253 | -0.180 | 0.061 | 0.143 | -0.085 | -0.068 | 0.043 | -0.083 | -0.483 | 0.035 |
| Segmented | 0.284 | 0.063 | 0.083 | 0.201 | 0.072 | 0.073 | 0.041 | 0.003 | 0.013 | -0.933 | 0.048 | 0.354 | -0.059 | 0.122 | -0.146 | 0.120 | -0.349 | 0.096 | 0.352 | NaN | 0.882 | 0.290 | 0.292 | 0.259 | 0.062 | -0.214 | 0.109 | 0.105 | 0.271 | 0.430 | 0.363 | 0.325 | -0.042 | -0.412 | 0.644 | -0.011 | 0.094 | 0.265 | 0.115 | NaN | 0.170 | 0.224 | 0.181 | 0.079 | 1.000 | 0.072 | -0.031 | -0.042 | NaN | 0.030 | NaN | -0.222 | NaN | -0.282 | 0.362 | -0.112 | NaN | NaN | 0.279 | 0.523 | 0.355 | -1.000 | NaN | -1.000 | 0.029 | -0.644 | 0.442 | 0.684 | -0.288 | -0.206 | 0.511 | -0.234 | -0.318 | 0.483 |
| Promyelocytes | 0.130 | -0.031 | 0.239 | -0.049 | -0.252 | -0.224 | 0.142 | -0.165 | -0.236 | -0.088 | 0.072 | 0.088 | -0.022 | 0.016 | -0.071 | -0.010 | -0.037 | 0.116 | NaN | NaN | NaN | -0.030 | 0.087 | -0.100 | 0.027 | -0.453 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -0.293 | NaN | 0.296 | -0.007 | 0.331 | -0.055 | NaN | 0.370 | -0.071 | 0.368 | 0.066 | 0.072 | 1.000 | 0.297 | 0.207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Metamyelocytes | 0.179 | -0.034 | 0.279 | 0.063 | -0.329 | -0.311 | -0.047 | 0.047 | -0.367 | -0.085 | -0.021 | 0.081 | 0.080 | 0.089 | 0.075 | 0.101 | 0.045 | 0.231 | 0.148 | NaN | -0.240 | 0.314 | 0.313 | 0.242 | -0.000 | -0.043 | 0.257 | 0.245 | 0.618 | 0.358 | 0.335 | 0.222 | 0.077 | -0.451 | 0.000 | 0.118 | 0.018 | 0.257 | 0.022 | NaN | 0.232 | 0.047 | 0.236 | 0.435 | -0.031 | 0.297 | 1.000 | 0.716 | NaN | -0.163 | NaN | 0.641 | NaN | -0.316 | -0.095 | 0.319 | NaN | NaN | NaN | 1.000 | -0.026 | NaN | NaN | NaN | 0.551 | -0.503 | -0.156 | 0.380 | -0.557 | -0.543 | 0.273 | 0.339 | -0.538 | 0.367 |
| Myelocytes | 0.089 | -0.070 | 0.416 | -0.051 | -0.428 | -0.396 | -0.095 | 0.078 | -0.445 | -0.039 | -0.017 | -0.028 | -0.050 | 0.075 | -0.063 | 0.079 | 0.048 | 0.188 | -0.046 | NaN | NaN | 0.306 | 0.281 | 0.154 | -0.093 | 0.014 | 0.192 | 0.047 | 0.944 | 0.304 | 0.211 | 0.298 | 0.204 | -0.188 | 0.181 | -0.004 | -0.168 | 0.223 | -0.171 | NaN | 0.137 | 0.150 | 0.147 | 0.320 | -0.042 | 0.207 | 0.716 | 1.000 | NaN | -0.151 | NaN | 0.977 | NaN | -0.235 | -0.157 | NaN | NaN | NaN | NaN | NaN | 0.036 | NaN | NaN | NaN | 0.467 | -0.451 | -0.452 | 0.282 | -0.798 | -0.815 | 0.211 | NaN | -0.513 | 0.031 |
| Myeloblasts | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Density | -0.118 | -0.202 | -0.141 | -0.101 | 0.192 | 0.179 | 0.061 | -0.003 | 0.219 | 0.196 | -0.042 | -0.038 | 0.006 | -0.080 | 0.097 | -0.095 | -0.028 | -0.055 | -0.141 | NaN | -0.289 | 0.199 | 0.013 | 0.140 | 0.397 | 0.067 | -0.266 | -0.144 | -0.359 | -0.213 | -0.294 | -0.129 | -0.206 | 0.217 | -0.430 | 0.399 | -0.381 | 0.064 | -0.353 | NaN | 0.227 | -0.390 | 0.208 | -0.264 | 0.030 | NaN | -0.163 | -0.151 | NaN | 1.000 | NaN | 0.033 | NaN | 0.089 | 0.054 | -0.517 | NaN | NaN | 0.391 | -0.405 | 0.894 | 0.735 | NaN | 0.000 | 0.467 | 0.291 | -0.294 | -0.281 | 0.213 | 0.113 | 0.269 | 0.874 | -0.154 | -0.211 |
| Urine - Sugar | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Urine - Red blood cells | 0.160 | -0.049 | 0.394 | -0.045 | -0.291 | -0.280 | -0.238 | 0.084 | -0.275 | -0.024 | -0.063 | -0.167 | -0.183 | -0.040 | -0.085 | -0.019 | -0.102 | 0.170 | -0.032 | NaN | 0.032 | 0.281 | 0.203 | 0.199 | 0.083 | -0.157 | -0.071 | -0.075 | -0.105 | 0.365 | 0.171 | 0.437 | -0.164 | -0.404 | -0.470 | 0.192 | -0.234 | 0.048 | -0.204 | NaN | 0.057 | -0.196 | 0.052 | 0.223 | -0.222 | NaN | 0.641 | 0.977 | NaN | 0.033 | NaN | 1.000 | NaN | 0.217 | 0.167 | -0.316 | NaN | NaN | -0.013 | -0.135 | 0.933 | 0.599 | NaN | -0.303 | 0.044 | 0.689 | -0.000 | -0.401 | 0.698 | 0.616 | -0.503 | 0.940 | -0.762 | -0.051 |
| Partial thromboplastin time (PTT) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Relationship (Patient/Normal) | -0.123 | 0.024 | -0.183 | -0.023 | -0.035 | -0.016 | -0.161 | -0.010 | -0.021 | 0.167 | 0.071 | -0.098 | -0.023 | 0.000 | 0.017 | -0.026 | 0.101 | 0.050 | 0.131 | NaN | -0.111 | -0.272 | -0.027 | -0.111 | -0.183 | -0.142 | 0.126 | 0.322 | -0.145 | -0.121 | -0.127 | -0.092 | 0.131 | 0.258 | 0.378 | -0.063 | -0.045 | -0.251 | -0.053 | NaN | -0.203 | -0.159 | -0.207 | -0.346 | -0.282 | NaN | -0.316 | -0.235 | NaN | 0.089 | NaN | 0.217 | NaN | 1.000 | 0.230 | 0.057 | NaN | NaN | 0.038 | -0.444 | -0.396 | NaN | NaN | -0.786 | 0.389 | -0.161 | -0.116 | 0.062 | -0.152 | -0.150 | 0.247 | -0.554 | 0.211 | 0.161 |
| International normalized ratio (INR) | 0.014 | -0.100 | 0.083 | 0.186 | -0.050 | -0.012 | 0.104 | 0.103 | -0.087 | -0.181 | 0.149 | 0.094 | -0.064 | 0.156 | 0.142 | 0.104 | 0.060 | -0.003 | 0.158 | NaN | 0.092 | 0.124 | 0.279 | 0.146 | 0.129 | -0.133 | 0.228 | 0.324 | -0.036 | 0.247 | 0.262 | 0.186 | 0.245 | -0.140 | -0.058 | -0.068 | 0.030 | 0.072 | 0.022 | NaN | 0.016 | 0.107 | 0.024 | -0.096 | 0.362 | NaN | -0.095 | -0.157 | NaN | 0.054 | NaN | 0.167 | NaN | 0.230 | 1.000 | 0.149 | NaN | NaN | 0.171 | 0.635 | -0.129 | NaN | NaN | 0.619 | 0.120 | 0.216 | 0.228 | -0.108 | 0.257 | 0.252 | 0.040 | 0.105 | -0.030 | 0.000 |
| Lactic Dehydrogenase | -0.150 | 0.118 | 0.191 | 0.361 | -0.303 | -0.290 | 0.083 | -0.205 | -0.066 | -0.048 | -0.009 | 0.278 | -0.134 | -0.374 | -0.334 | -0.398 | 0.009 | 0.335 | 0.184 | NaN | 0.122 | 0.044 | 0.298 | -0.195 | 0.225 | -0.150 | 0.194 | 0.431 | 0.332 | 0.297 | 0.375 | 0.160 | 0.563 | -0.427 | 0.137 | -0.173 | -0.106 | -0.312 | -0.025 | NaN | -0.299 | -0.139 | -0.301 | 0.036 | -0.112 | NaN | 0.319 | NaN | NaN | -0.517 | NaN | -0.316 | NaN | 0.057 | 0.149 | 1.000 | NaN | NaN | 0.043 | 0.894 | -0.454 | NaN | NaN | -1.000 | -0.456 | -0.165 | -0.252 | 0.073 | -0.324 | -0.320 | -0.331 | 0.003 | -1.000 | -0.417 |
| Prothrombin time (PT), Activity | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Vitamin B12 | 0.981 | NaN | 0.615 | NaN | -0.476 | -0.810 | -0.450 | -0.999 | 0.323 | -0.314 | -0.928 | 0.839 | -0.323 | -0.840 | 0.183 | -0.807 | -0.922 | 0.172 | NaN | NaN | -1.000 | 0.970 | 0.634 | 0.748 | -0.372 | 0.658 | 0.928 | 0.684 | 0.375 | NaN | NaN | NaN | 0.833 | -1.000 | -0.375 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000 | NaN | -1.000 | NaN | NaN | NaN | -1.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -0.024 | NaN |
| Creatine phosphokinase (CPK) | -0.101 | -0.080 | -0.001 | -0.020 | 0.073 | 0.079 | -0.088 | 0.210 | 0.103 | -0.125 | 0.050 | 0.055 | -0.042 | -0.037 | -0.032 | -0.066 | -0.049 | 0.027 | -0.018 | NaN | 0.105 | 0.113 | -0.041 | 0.145 | -0.136 | -0.100 | 0.037 | 0.034 | -0.051 | 0.015 | -0.035 | 0.047 | 0.126 | 0.084 | 0.101 | 0.019 | -0.154 | 0.037 | -0.116 | NaN | 0.041 | 0.016 | 0.043 | -0.163 | 0.279 | NaN | NaN | NaN | NaN | 0.391 | NaN | -0.013 | NaN | 0.038 | 0.171 | 0.043 | NaN | NaN | 1.000 | -0.296 | -0.131 | -1.000 | NaN | 0.722 | -0.834 | 0.795 | 0.428 | -0.462 | 0.556 | 0.549 | -0.474 | -0.017 | -0.071 | 0.079 |
| Ferritin | 0.396 | 0.410 | 0.084 | 0.820 | -0.538 | -0.537 | -0.661 | 0.024 | -0.462 | -0.529 | -0.092 | 0.427 | -0.051 | -0.114 | -0.289 | -0.084 | 0.262 | 0.864 | 0.901 | NaN | -0.116 | 0.549 | 0.781 | 0.406 | -0.207 | -0.350 | 0.785 | 0.816 | 0.841 | 0.839 | 0.947 | 0.519 | 0.044 | -0.949 | 0.641 | -0.766 | 0.995 | 0.334 | 0.951 | NaN | 0.131 | 0.474 | 0.154 | 1.000 | 0.523 | NaN | 1.000 | NaN | NaN | -0.405 | NaN | -0.135 | NaN | -0.444 | 0.635 | 0.894 | NaN | -1.000 | -0.296 | 1.000 | NaN | -1.000 | NaN | 0.279 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.310 | NaN |
| Arterial Lactic Acid | 0.097 | -0.076 | 0.024 | -0.205 | 0.112 | 0.034 | -0.031 | 0.302 | -0.013 | -0.177 | -0.179 | 0.184 | -0.293 | 0.092 | -0.149 | 0.225 | -0.454 | 0.296 | 0.399 | NaN | 0.355 | 0.244 | 0.147 | 0.229 | 0.184 | -0.053 | -0.009 | -0.035 | -0.301 | 0.420 | 0.444 | 0.283 | -0.345 | 0.651 | 0.075 | 0.160 | 0.669 | -0.284 | 0.648 | NaN | -0.101 | -0.433 | -0.119 | -0.047 | 0.355 | NaN | -0.026 | 0.036 | NaN | 0.894 | NaN | 0.933 | NaN | -0.396 | -0.129 | -0.454 | NaN | NaN | -0.131 | NaN | 1.000 | NaN | NaN | NaN | -0.039 | 0.042 | -0.332 | -0.167 | -0.205 | -0.225 | 0.128 | 0.062 | 1.000 | 0.124 |
| Lipase dosage | -0.357 | 0.316 | NaN | NaN | 0.170 | 0.184 | -0.477 | 0.084 | 0.612 | -0.421 | 0.032 | -0.348 | -0.693 | -0.253 | -0.490 | -0.323 | -0.652 | -0.189 | -1.000 | NaN | 0.884 | -0.029 | 0.405 | 0.737 | -0.095 | 0.023 | 0.199 | 0.310 | 0.350 | 0.234 | -0.339 | 0.348 | -0.122 | NaN | 1.000 | -1.000 | 1.000 | -1.000 | 1.000 | NaN | -1.000 | 1.000 | -1.000 | -1.000 | -1.000 | NaN | NaN | NaN | NaN | 0.735 | NaN | 0.599 | NaN | NaN | NaN | NaN | NaN | NaN | -1.000 | -1.000 | NaN | 1.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| D-Dimer | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Albumin | -0.137 | NaN | -0.661 | NaN | 0.537 | 0.556 | 0.295 | 0.539 | 0.441 | 0.266 | 0.068 | -0.234 | 0.277 | 0.179 | 0.149 | 0.131 | -0.106 | -0.384 | -0.791 | NaN | 0.393 | -0.402 | -0.851 | 0.321 | 0.425 | 0.235 | -0.220 | -0.486 | -0.273 | 0.159 | -0.087 | 0.244 | -0.247 | 0.517 | 0.012 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | -1.000 | NaN | NaN | NaN | NaN | 0.000 | NaN | -0.303 | NaN | -0.786 | 0.619 | -1.000 | NaN | -1.000 | 0.722 | 0.279 | NaN | NaN | NaN | 1.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.023 | NaN |
| Hb saturation (arterial blood gases) | -0.224 | 0.198 | -0.559 | 0.352 | -0.046 | -0.035 | 0.083 | -0.351 | 0.029 | 0.101 | 0.009 | 0.024 | 0.059 | -0.139 | 0.296 | -0.177 | -0.025 | -0.071 | 0.025 | NaN | -0.123 | 0.215 | 0.232 | 0.093 | 0.093 | 0.255 | -0.162 | -0.246 | -0.107 | -0.124 | -0.108 | -0.113 | -0.351 | -0.525 | 0.141 | 0.584 | -0.032 | 0.341 | -0.149 | NaN | 0.523 | -0.214 | 0.506 | 0.253 | 0.029 | NaN | 0.551 | 0.467 | NaN | 0.467 | NaN | 0.044 | NaN | 0.389 | 0.120 | -0.456 | NaN | NaN | -0.834 | NaN | -0.039 | NaN | NaN | NaN | 1.000 | -0.291 | -0.152 | 0.180 | -0.355 | -0.349 | 0.796 | -0.018 | -1.000 | 0.128 |
| pCO2 (arterial blood gas analysis) | -0.469 | -0.227 | 0.113 | 0.298 | -0.180 | -0.179 | 0.539 | 0.090 | -0.351 | 0.500 | -0.090 | 0.490 | -0.315 | 0.321 | -0.022 | 0.466 | 0.385 | 0.393 | 0.112 | NaN | -0.252 | 0.186 | -0.254 | -0.012 | 0.028 | 0.155 | 0.094 | 0.022 | -0.225 | 0.087 | -0.071 | 0.257 | -0.132 | 0.916 | -0.188 | 0.246 | 0.513 | -0.107 | 0.378 | NaN | 0.079 | -0.368 | 0.063 | -0.180 | -0.644 | NaN | -0.503 | -0.451 | NaN | 0.291 | NaN | 0.689 | NaN | -0.161 | 0.216 | -0.165 | NaN | NaN | 0.795 | NaN | 0.042 | NaN | NaN | NaN | -0.291 | 1.000 | -0.310 | -0.940 | 0.611 | 0.514 | -0.297 | 0.401 | 1.000 | -0.261 |
| Base excess (arterial blood gas analysis) | 0.570 | 0.033 | -0.226 | 0.204 | -0.196 | -0.273 | -0.296 | 0.162 | 0.040 | -0.403 | -0.344 | -0.493 | 0.369 | -0.607 | 0.357 | -0.528 | -0.241 | -0.048 | 0.213 | NaN | 0.131 | 0.255 | 0.035 | -0.104 | -0.008 | 0.130 | -0.132 | -0.173 | -0.086 | -0.232 | -0.212 | -0.196 | 0.039 | -0.616 | 0.453 | 0.389 | -0.353 | 0.932 | -0.335 | NaN | 0.859 | 0.491 | 0.876 | 0.061 | 0.442 | NaN | -0.156 | -0.452 | NaN | -0.294 | NaN | -0.000 | NaN | -0.116 | 0.228 | -0.252 | NaN | NaN | 0.428 | NaN | -0.332 | NaN | NaN | NaN | -0.152 | -0.310 | 1.000 | 0.602 | 0.550 | 0.644 | -0.093 | -0.262 | -1.000 | -0.256 |
| pH (arterial blood gas analysis) | 0.571 | 0.204 | -0.179 | -0.180 | 0.064 | 0.036 | -0.525 | -0.010 | 0.260 | -0.545 | -0.048 | -0.563 | 0.431 | -0.431 | 0.131 | -0.510 | -0.409 | -0.318 | 0.121 | NaN | 0.363 | 0.008 | 0.214 | -0.081 | 0.012 | -0.082 | -0.246 | -0.198 | 0.201 | -0.318 | -0.095 | -0.507 | 0.235 | -0.834 | 0.325 | 0.023 | -0.706 | 0.729 | -0.552 | NaN | 0.503 | 0.685 | 0.530 | 0.143 | 0.684 | NaN | 0.380 | 0.282 | NaN | -0.281 | NaN | -0.401 | NaN | 0.062 | -0.108 | 0.073 | NaN | NaN | -0.462 | NaN | -0.167 | NaN | NaN | NaN | 0.180 | -0.940 | 0.602 | 1.000 | -0.322 | -0.209 | 0.200 | -0.387 | -1.000 | 0.091 |
| Total CO2 (arterial blood gas analysis) | 0.086 | -0.160 | -0.113 | 0.425 | -0.344 | -0.419 | 0.200 | 0.262 | -0.303 | 0.083 | -0.411 | 0.043 | 0.054 | -0.239 | 0.334 | -0.020 | 0.126 | 0.367 | 0.207 | NaN | -0.041 | 0.329 | -0.198 | -0.070 | 0.018 | 0.309 | -0.034 | -0.090 | -0.146 | -0.148 | -0.201 | -0.044 | -0.066 | 0.365 | 0.521 | 0.540 | 0.063 | 0.647 | -0.028 | NaN | 0.742 | 0.068 | 0.740 | -0.085 | -0.288 | NaN | -0.557 | -0.798 | NaN | 0.213 | NaN | 0.698 | NaN | -0.152 | 0.257 | -0.324 | NaN | NaN | 0.556 | NaN | -0.205 | NaN | NaN | NaN | -0.355 | 0.611 | 0.550 | -0.322 | 1.000 | 0.993 | -0.315 | 0.168 | -1.000 | -0.442 |
| HCO3 (arterial blood gas analysis) | 0.166 | -0.133 | -0.137 | 0.411 | -0.340 | -0.421 | 0.134 | 0.267 | -0.268 | 0.013 | -0.427 | -0.032 | 0.101 | -0.309 | 0.357 | -0.098 | 0.076 | 0.331 | 0.212 | NaN | -0.019 | 0.326 | -0.180 | -0.074 | 0.012 | 0.303 | -0.044 | -0.100 | -0.145 | -0.158 | -0.208 | -0.057 | -0.066 | 0.153 | 0.557 | 0.536 | 0.025 | 0.678 | -0.060 | NaN | 0.762 | 0.101 | 0.761 | -0.068 | -0.206 | NaN | -0.543 | -0.815 | NaN | 0.113 | NaN | 0.616 | NaN | -0.150 | 0.252 | -0.320 | NaN | NaN | 0.549 | NaN | -0.225 | NaN | NaN | NaN | -0.349 | 0.514 | 0.644 | -0.209 | 0.993 | 1.000 | -0.303 | 0.105 | -1.000 | -0.435 |
| pO2 (arterial blood gas analysis) | -0.098 | 0.106 | -0.339 | 0.156 | 0.124 | 0.081 | -0.138 | -0.226 | 0.200 | 0.058 | -0.067 | -0.262 | 0.190 | -0.210 | 0.134 | -0.216 | -0.009 | -0.113 | -0.157 | NaN | -0.211 | -0.051 | 0.095 | 0.073 | 0.090 | 0.099 | -0.214 | -0.226 | 0.357 | -0.056 | -0.032 | -0.071 | 0.107 | -0.374 | -0.213 | 0.486 | -0.283 | 0.360 | -0.340 | NaN | 0.501 | -0.095 | 0.496 | 0.043 | 0.511 | NaN | 0.273 | 0.211 | NaN | 0.269 | NaN | -0.503 | NaN | 0.247 | 0.040 | -0.331 | NaN | NaN | -0.474 | NaN | 0.128 | NaN | NaN | NaN | 0.796 | -0.297 | -0.093 | 0.200 | -0.315 | -0.303 | 1.000 | -0.194 | -1.000 | 0.271 |
| Arteiral Fio2 | -0.335 | -0.174 | -0.091 | 0.348 | 0.066 | -0.003 | 0.472 | 0.080 | -0.258 | 0.227 | -0.166 | 0.828 | -0.322 | 0.422 | -0.118 | 0.640 | -0.123 | 0.402 | -0.316 | NaN | 0.650 | 0.382 | 0.382 | 0.717 | 0.364 | 0.113 | -0.201 | -0.222 | -0.745 | 0.420 | 0.398 | 0.334 | 0.180 | 0.081 | 0.168 | 0.092 | 0.833 | -0.307 | 0.878 | NaN | -0.237 | -0.464 | -0.255 | -0.083 | -0.234 | NaN | 0.339 | NaN | NaN | 0.874 | NaN | 0.940 | NaN | -0.554 | 0.105 | 0.003 | NaN | NaN | -0.017 | NaN | 0.062 | NaN | NaN | NaN | -0.018 | 0.401 | -0.262 | -0.387 | 0.168 | 0.105 | -0.194 | 1.000 | 1.000 | -0.212 |
| Phosphor | -0.512 | NaN | 0.185 | 0.130 | 0.172 | 0.260 | 0.125 | -0.222 | 0.190 | 0.109 | 0.478 | 0.322 | -0.214 | 0.277 | -0.308 | 0.068 | 0.043 | 0.125 | 0.139 | NaN | 0.146 | -0.072 | -0.307 | -0.115 | 0.136 | -0.413 | -0.021 | 0.032 | 0.088 | 0.402 | 0.394 | 0.259 | 0.332 | 0.384 | 0.152 | 0.032 | -0.108 | -0.456 | -0.118 | NaN | -0.336 | -0.461 | -0.353 | -0.483 | -0.318 | NaN | -0.538 | -0.513 | NaN | -0.154 | NaN | -0.762 | NaN | 0.211 | -0.030 | -1.000 | NaN | -0.024 | -0.071 | 0.310 | 1.000 | NaN | NaN | 0.023 | -1.000 | 1.000 | -1.000 | -1.000 | -1.000 | -1.000 | -1.000 | 1.000 | 1.000 | 1.000 |
| ctO2 (arterial blood gas analysis) | -0.061 | 0.273 | -0.049 | -0.383 | 0.878 | 0.884 | -0.483 | 0.018 | 0.848 | -0.136 | 0.384 | -0.201 | -0.053 | 0.130 | -0.433 | -0.082 | 0.078 | -0.316 | -0.473 | NaN | -0.063 | -0.550 | -0.062 | 0.157 | -0.132 | -0.183 | 0.069 | 0.059 | -0.082 | 0.254 | 0.068 | 0.416 | 0.071 | -0.114 | -0.691 | -0.333 | 0.250 | -0.813 | 0.294 | NaN | -0.741 | -0.417 | -0.753 | 0.035 | 0.483 | NaN | 0.367 | 0.031 | NaN | -0.211 | NaN | -0.051 | NaN | 0.161 | 0.000 | -0.417 | NaN | NaN | 0.079 | NaN | 0.124 | NaN | NaN | NaN | 0.128 | -0.261 | -0.256 | 0.091 | -0.442 | -0.435 | 0.271 | -0.212 | 1.000 | 1.000 |
plt.figure(figsize=(15, 7))
sns.heatmap(corr, annot=True, cmap="Spectral")
plt.show()
This dataset holds a wealth of information concerning Covid-19 testing and predictive factors.
Data cleaning is recommended to facilitate a comprehensive analysis that can produce business intelligence and predictive modeling insights.
df.columns= df.columns.str.strip().str.replace('(','') #Remove brackets
df.columns= df.columns.str.strip().str.replace(')','')
df.columns= df.columns.str.replace('-','_') #Replace hyphens with underscore sign
df.columns= df.columns.str.replace(',','_') #Replace commas with underscore sign
df.columns= df.columns.str.replace('/','_') #Replace backward slash with underscore sign
df.columns= df.columns.str.replace(' ','_') #Replace all spaces in column names with underscore signs
df
| Patient_ID | Patient_age_quantile | SARS_Cov_2_exam_result | Patient_addmited_to_regular_ward_1=yes__0=no | Patient_addmited_to_semi_intensive_unit_1=yes__0=no | Patient_addmited_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Platelets | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Basophils | Mean_corpuscular_hemoglobin_MCH | Eosinophils | Mean_corpuscular_volume_MCV | Monocytes | Red_blood_cell_distribution_width_RDW | Serum_Glucose | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Mycoplasma_pneumoniae | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Neutrophils | Urea | Proteina_C_reativa_mg_dL | Creatinine | Potassium | Sodium | Influenza_B__rapid_test | Influenza_A__rapid_test | Alanine_transaminase | Aspartate_transaminase | Gamma_glutamyltransferase | Total_Bilirubin | Direct_Bilirubin | Indirect_Bilirubin | Alkaline_phosphatase | Ionized_calcium | Strepto_A | Magnesium | pCO2_venous_blood_gas_analysis | Hb_saturation_venous_blood_gas_analysis | Base_excess_venous_blood_gas_analysis | pO2_venous_blood_gas_analysis | Fio2_venous_blood_gas_analysis | Total_CO2_venous_blood_gas_analysis | pH_venous_blood_gas_analysis | HCO3_venous_blood_gas_analysis | Rods_# | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine___Esterase | Urine___Aspect | Urine___pH | Urine___Hemoglobin | Urine___Bile_pigments | Urine___Ketone_Bodies | Urine___Nitrite | Urine___Density | Urine___Urobilinogen | Urine___Protein | Urine___Sugar | Urine___Leukocytes | Urine___Crystals | Urine___Red_blood_cells | Urine___Hyaline_cylinders | Urine___Granular_cylinders | Urine___Yeasts | Urine___Color | Partial_thromboplastin_time PTT | Relationship_Patient_Normal | International_normalized_ratio_INR | Lactic_Dehydrogenase | Prothrombin_time_PT__Activity | Vitamin_B12 | Creatine_phosphokinase CPK | Ferritin | Arterial_Lactic_Acid | Lipase_dosage | D_Dimer | Albumin | Hb_saturation_arterial_blood_gases | pCO2_arterial_blood_gas_analysis | Base_excess_arterial_blood_gas_analysis | pH_arterial_blood_gas_analysis | Total_CO2_arterial_blood_gas_analysis | HCO3_arterial_blood_gas_analysis | pO2_arterial_blood_gas_analysis | Arteiral_Fio2 | Phosphor | ctO2_arterial_blood_gas_analysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44477f75e8169d2 | 13 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 126e9dd13932f68 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | -0.141 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | -0.619 | 1.198 | -0.148 | 2.090 | -0.306 | 0.863 | negative | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | a46b4402a0e5696 | 8 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | f7d619a94f97c45 | 5 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | d9e41465789c2b5 | 15 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5639 | ae66feb9e4dc3a0 | 3 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5640 | 517c2834024f3ea | 17 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5641 | 5c57d6037fe266d | 4 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5642 | c20c44766f28291 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | clear | 5 | absent | absent | absent | NaN | -0.339 | normal | absent | NaN | 29000 | Ausentes | -0.177 | absent | absent | absent | yellow | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5643 | 2697fdccbfeb7f7 | 19 | positive | 0 | 0 | 0 | 0.694 | 0.542 | -0.907 | -0.326 | 0.578 | -0.296 | -0.353 | -1.288 | -1.140 | -0.135 | -0.836 | 0.026 | 0.568 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.381 | 0.454 | -0.504 | -0.736 | -0.553 | -0.934 | NaN | NaN | -0.284 | 0.109 | -0.420 | -0.481 | -0.586 | -0.279 | -0.243 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.420 | NaN | NaN | -0.343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5644 rows × 111 columns
#Editing column names: Correcting grammatical errors
# Changing "addmited" to "admitted"
df.rename(columns={'Patient_addmited_to_regular_ward_1=yes__0=no': 'Patient_admitted_to_regular_ward_1=yes__0=no', 'Patient_addmited_to_semi_intensive_unit_1=yes__0=no': 'Patient_admitted_to_semi_intensive_unit_1=yes__0=no', 'Patient_addmited_to_intensive_care_unit_1=yes__0=no': 'Patient_admitted_to_intensive_care_unit_1=yes__0=no'}, inplace=True)
df
| Patient_ID | Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Platelets | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Basophils | Mean_corpuscular_hemoglobin_MCH | Eosinophils | Mean_corpuscular_volume_MCV | Monocytes | Red_blood_cell_distribution_width_RDW | Serum_Glucose | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Mycoplasma_pneumoniae | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Neutrophils | Urea | Proteina_C_reativa_mg_dL | Creatinine | Potassium | Sodium | Influenza_B__rapid_test | Influenza_A__rapid_test | Alanine_transaminase | Aspartate_transaminase | Gamma_glutamyltransferase | Total_Bilirubin | Direct_Bilirubin | Indirect_Bilirubin | Alkaline_phosphatase | Ionized_calcium | Strepto_A | Magnesium | pCO2_venous_blood_gas_analysis | Hb_saturation_venous_blood_gas_analysis | Base_excess_venous_blood_gas_analysis | pO2_venous_blood_gas_analysis | Fio2_venous_blood_gas_analysis | Total_CO2_venous_blood_gas_analysis | pH_venous_blood_gas_analysis | HCO3_venous_blood_gas_analysis | Rods_# | Segmented | Promyelocytes | Metamyelocytes | Myelocytes | Myeloblasts | Urine___Esterase | Urine___Aspect | Urine___pH | Urine___Hemoglobin | Urine___Bile_pigments | Urine___Ketone_Bodies | Urine___Nitrite | Urine___Density | Urine___Urobilinogen | Urine___Protein | Urine___Sugar | Urine___Leukocytes | Urine___Crystals | Urine___Red_blood_cells | Urine___Hyaline_cylinders | Urine___Granular_cylinders | Urine___Yeasts | Urine___Color | Partial_thromboplastin_time PTT | Relationship_Patient_Normal | International_normalized_ratio_INR | Lactic_Dehydrogenase | Prothrombin_time_PT__Activity | Vitamin_B12 | Creatine_phosphokinase CPK | Ferritin | Arterial_Lactic_Acid | Lipase_dosage | D_Dimer | Albumin | Hb_saturation_arterial_blood_gases | pCO2_arterial_blood_gas_analysis | Base_excess_arterial_blood_gas_analysis | pH_arterial_blood_gas_analysis | Total_CO2_arterial_blood_gas_analysis | HCO3_arterial_blood_gas_analysis | pO2_arterial_blood_gas_analysis | Arteiral_Fio2 | Phosphor | ctO2_arterial_blood_gas_analysis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44477f75e8169d2 | 13 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 126e9dd13932f68 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | -0.141 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | -0.619 | 1.198 | -0.148 | 2.090 | -0.306 | 0.863 | negative | negative | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | a46b4402a0e5696 | 8 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | f7d619a94f97c45 | 5 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | d9e41465789c2b5 | 15 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | detected | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5639 | ae66feb9e4dc3a0 | 3 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5640 | 517c2834024f3ea | 17 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5641 | 5c57d6037fe266d | 4 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5642 | c20c44766f28291 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | clear | 5 | absent | absent | absent | NaN | -0.339 | normal | absent | NaN | 29000 | Ausentes | -0.177 | absent | absent | absent | yellow | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5643 | 2697fdccbfeb7f7 | 19 | positive | 0 | 0 | 0 | 0.694 | 0.542 | -0.907 | -0.326 | 0.578 | -0.296 | -0.353 | -1.288 | -1.140 | -0.135 | -0.836 | 0.026 | 0.568 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.381 | 0.454 | -0.504 | -0.736 | -0.553 | -0.934 | NaN | NaN | -0.284 | 0.109 | -0.420 | -0.481 | -0.586 | -0.279 | -0.243 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.420 | NaN | NaN | -0.343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5644 rows × 111 columns
df_clean=df.drop('Patient_ID', axis=1, inplace=True)
#Deciding which columns to drop based on missing values
#Code to establish percentage of missing values per column
df_null_1=((df.isnull().sum())/5644*100).sort_values(ascending= False).head(60)
df_null_1
Mycoplasma_pneumoniae 100.000 Urine___Sugar 100.000 Partial_thromboplastin_time PTT 100.000 Prothrombin_time_PT__Activity 100.000 D_Dimer 100.000 Fio2_venous_blood_gas_analysis 99.982 Urine___Nitrite 99.982 Vitamin_B12 99.947 Lipase_dosage 99.858 Albumin 99.770 Arteiral_Fio2 99.646 Phosphor 99.646 Ferritin 99.592 Hb_saturation_arterial_blood_gases 99.522 pCO2_arterial_blood_gas_analysis 99.522 Base_excess_arterial_blood_gas_analysis 99.522 pH_arterial_blood_gas_analysis 99.522 Arterial_Lactic_Acid 99.522 Total_CO2_arterial_blood_gas_analysis 99.522 HCO3_arterial_blood_gas_analysis 99.522 pO2_arterial_blood_gas_analysis 99.522 ctO2_arterial_blood_gas_analysis 99.522 Magnesium 99.291 Ionized_calcium 99.114 Urine___Ketone_Bodies 98.990 Urine___Protein 98.937 Urine___Esterase 98.937 Urine___Hyaline_cylinders 98.813 Urine___Granular_cylinders 98.777 Urine___Urobilinogen 98.777 Urine___pH 98.760 Urine___Hemoglobin 98.760 Urine___Bile_pigments 98.760 Urine___Color 98.760 Urine___Density 98.760 Urine___Leukocytes 98.760 Urine___Crystals 98.760 Urine___Red_blood_cells 98.760 Urine___Yeasts 98.760 Urine___Aspect 98.760 Relationship_Patient_Normal 98.388 Myeloblasts 98.281 Myelocytes 98.281 Metamyelocytes 98.281 Segmented 98.281 Rods_# 98.281 Promyelocytes 98.281 Lactic_Dehydrogenase 98.210 Creatine_phosphokinase CPK 98.157 International_normalized_ratio_INR 97.644 pCO2_venous_blood_gas_analysis 97.590 Base_excess_venous_blood_gas_analysis 97.590 HCO3_venous_blood_gas_analysis 97.590 pH_venous_blood_gas_analysis 97.590 Total_CO2_venous_blood_gas_analysis 97.590 pO2_venous_blood_gas_analysis 97.590 Hb_saturation_venous_blood_gas_analysis 97.590 Alkaline_phosphatase 97.449 Gamma_glutamyltransferase 97.289 Indirect_Bilirubin 96.775 dtype: float64
#Code to determine columns with least number of missing values
df_null_2=((df.isnull().sum())/5644*100).sort_values(ascending= True).head(51)
df_null_2
Patient_age_quantile 0.000 SARS_Cov_2_exam_result 0.000 Patient_admitted_to_regular_ward_1=yes__0=no 0.000 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.000 Patient_admitted_to_intensive_care_unit_1=yes__0=no 0.000 Influenza_B 76.010 Respiratory_Syncytial_Virus 76.010 Influenza_A 76.010 Rhinovirus_Enterovirus 76.045 Inf_A_H1N1_2009 76.045 CoronavirusOC43 76.045 Coronavirus229E 76.045 Parainfluenza_4 76.045 Adenovirus 76.045 Chlamydophila_pneumoniae 76.045 Parainfluenza_3 76.045 Coronavirus_HKU1 76.045 CoronavirusNL63 76.045 Parainfluenza_1 76.045 Bordetella_pertussis 76.045 Parainfluenza_2 76.045 Metapneumovirus 76.045 Influenza_A__rapid_test 85.471 Influenza_B__rapid_test 85.471 Hemoglobin 89.316 Hematocrit 89.316 Red_blood_cell_distribution_width_RDW 89.334 Platelets 89.334 Mean_corpuscular_volume_MCV 89.334 Eosinophils 89.334 Mean_corpuscular_hemoglobin_MCH 89.334 Basophils 89.334 Leukocytes 89.334 Mean_corpuscular_hemoglobin_concentration MCHC 89.334 Lymphocytes 89.334 Red_blood_Cells 89.334 Monocytes 89.352 Mean_platelet_volume 89.387 Neutrophils 90.911 Proteina_C_reativa_mg_dL 91.035 Creatinine 92.488 Urea 92.966 Potassium 93.427 Sodium 93.444 Strepto_A 94.118 Aspartate_transaminase 95.996 Alanine_transaminase 96.013 Serum_Glucose 96.315 Total_Bilirubin 96.775 Direct_Bilirubin 96.775 Indirect_Bilirubin 96.775 dtype: float64
#Dropping columns with more than 98% of missing values
df_clean1=df.loc[:,df.isnull().mean()<0.90]
df_clean1
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Platelets | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Basophils | Mean_corpuscular_hemoglobin_MCH | Eosinophils | Mean_corpuscular_volume_MCV | Monocytes | Red_blood_cell_distribution_width_RDW | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 5 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 15 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5639 | 3 | positive | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5640 | 17 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5641 | 4 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5642 | 10 | negative | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5643 | 19 | positive | 0 | 0 | 0 | 0.694 | 0.542 | -0.907 | -0.326 | 0.578 | -0.296 | -0.353 | -1.288 | -1.140 | -0.135 | -0.836 | 0.026 | 0.568 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5644 rows × 38 columns
#Code to fill in missing values of numeric data with a median
numeric_columns = df_clean1.select_dtypes(include=np.number).columns
df_clean1[numeric_columns]=df_clean1[numeric_columns].fillna(df_clean1[numeric_columns].median())
df_clean1
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Platelets | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Basophils | Mean_corpuscular_hemoglobin_MCH | Eosinophils | Mean_corpuscular_volume_MCV | Monocytes | Red_blood_cell_distribution_width_RDW | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 5 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 15 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5639 | 3 | positive | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5640 | 17 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5641 | 4 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5642 | 10 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5643 | 19 | positive | 0 | 0 | 0 | 0.694 | 0.542 | -0.907 | -0.326 | 0.578 | -0.296 | -0.353 | -1.288 | -1.140 | -0.135 | -0.836 | 0.026 | 0.568 | -0.183 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5644 rows × 38 columns
# filling with Unknown class
cat_cols= ['Respiratory_Syncytial_Virus','Influenza_A','Influenza_B','Parainfluenza_1',
'CoronavirusNL63','Rhinovirus_Enterovirus','Coronavirus_HKU1','Parainfluenza_3',
'Chlamydophila_pneumoniae','Adenovirus','Parainfluenza_4','Coronavirus229E',
'CoronavirusOC43', 'Inf_A_H1N1_2009', 'Bordetella_pertussis','Metapneumovirus',
'Parainfluenza_2','Influenza_B__rapid_test','Influenza_A__rapid_test'
]
df_clean1[cat_cols] = df_clean1[cat_cols].fillna("Unknown")
df_clean1
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Platelets | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Basophils | Mean_corpuscular_hemoglobin_MCH | Eosinophils | Mean_corpuscular_volume_MCV | Monocytes | Red_blood_cell_distribution_width_RDW | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | -0.517 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.224 | -0.292 | 1.482 | 0.166 | 0.358 | -0.625 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 3 | 5 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 4 | 15 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | Unknown | Unknown |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5639 | 3 | positive | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 5640 | 17 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 5641 | 4 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 5642 | 10 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.122 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | -0.224 | 0.126 | -0.330 | 0.066 | -0.115 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 5643 | 19 | positive | 0 | 0 | 0 | 0.694 | 0.542 | -0.907 | -0.326 | 0.578 | -0.296 | -0.353 | -1.288 | -1.140 | -0.135 | -0.836 | 0.026 | 0.568 | -0.183 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
5644 rows × 38 columns
df_clean1.isna().sum()
Patient_age_quantile 0 SARS_Cov_2_exam_result 0 Patient_admitted_to_regular_ward_1=yes__0=no 0 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0 Patient_admitted_to_intensive_care_unit_1=yes__0=no 0 Hematocrit 0 Hemoglobin 0 Platelets 0 Mean_platelet_volume 0 Red_blood_Cells 0 Lymphocytes 0 Mean_corpuscular_hemoglobin_concentration MCHC 0 Leukocytes 0 Basophils 0 Mean_corpuscular_hemoglobin_MCH 0 Eosinophils 0 Mean_corpuscular_volume_MCV 0 Monocytes 0 Red_blood_cell_distribution_width_RDW 0 Respiratory_Syncytial_Virus 0 Influenza_A 0 Influenza_B 0 Parainfluenza_1 0 CoronavirusNL63 0 Rhinovirus_Enterovirus 0 Coronavirus_HKU1 0 Parainfluenza_3 0 Chlamydophila_pneumoniae 0 Adenovirus 0 Parainfluenza_4 0 Coronavirus229E 0 CoronavirusOC43 0 Inf_A_H1N1_2009 0 Bordetella_pertussis 0 Metapneumovirus 0 Parainfluenza_2 0 Influenza_B__rapid_test 0 Influenza_A__rapid_test 0 dtype: int64
# outlier detection using boxplot
plt.figure(figsize=(30, 100))
for i, variable in enumerate(numeric_columns):
plt.subplot(20, 3, i + 1)
plt.boxplot(df_clean1[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
treat_out_cols = [
'Platelets',
'Basophils',
'Eosinophils',
'Monocytes',
'Red_blood_cell_distribution_width_RDW'
]
df_clean2 = treat_outliers_all(df_clean1, treat_out_cols)
# Checking to see result of outlier treatment
plt.figure(figsize=(9, 15))
for i, variable in enumerate(treat_out_cols):
plt.subplot(3, 4, i + 1)
plt.boxplot(df_clean2[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Selected columns have been treated
df_clean2.nunique()
Patient_age_quantile 20 SARS_Cov_2_exam_result 2 Patient_admitted_to_regular_ward_1=yes__0=no 2 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 2 Patient_admitted_to_intensive_care_unit_1=yes__0=no 2 Hematocrit 176 Hemoglobin 84 Platelets 1 Mean_platelet_volume 48 Red_blood_Cells 211 Lymphocytes 318 Mean_corpuscular_hemoglobin_concentration MCHC 57 Leukocytes 476 Basophils 1 Mean_corpuscular_hemoglobin_MCH 91 Eosinophils 1 Mean_corpuscular_volume_MCV 190 Monocytes 1 Red_blood_cell_distribution_width_RDW 1 Respiratory_Syncytial_Virus 3 Influenza_A 3 Influenza_B 3 Parainfluenza_1 3 CoronavirusNL63 3 Rhinovirus_Enterovirus 3 Coronavirus_HKU1 3 Parainfluenza_3 3 Chlamydophila_pneumoniae 3 Adenovirus 3 Parainfluenza_4 3 Coronavirus229E 3 CoronavirusOC43 3 Inf_A_H1N1_2009 3 Bordetella_pertussis 3 Metapneumovirus 3 Parainfluenza_2 2 Influenza_B__rapid_test 3 Influenza_A__rapid_test 3 dtype: int64
Five columns consist of only one value. These columns are:
One value is not enough data for developing meaningful insights. These columns will therefore be dropped.
df_clean3 = df_clean2.drop(
["Red_blood_cell_distribution_width_RDW", "Monocytes", "Basophils", "Eosinophils", "Platelets"], axis=1
)
df_clean3.head()
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1 | 17 | negative | 0 | 0 | 0 | 0.237 | -0.022 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.292 | 0.166 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 3 | 5 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 4 | 15 | negative | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | Unknown | Unknown |
# Replace entries in 'SARS_Cov_2_exam_result' with zeros and ones
replaceStruct = {"SARS_Cov_2_exam_result": {"negative":0, 'positive':1}}
df_model = df_clean3.replace(replaceStruct)
df_model.head()
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1 | 17 | 0 | 0 | 0 | 0 | 0.237 | -0.022 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.292 | 0.166 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 3 | 5 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 4 | 15 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | Unknown | Unknown |
Values in the 'SARS_Cov_2_exam_result' have been replaced with zeros and ones
df_model[cat_cols]=df_model[cat_cols].astype('category')
Object data types have been transformed into categorical
df_model.head()
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1 | 17 | 0 | 0 | 0 | 0 | 0.237 | -0.022 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.292 | 0.166 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | negative | negative |
| 2 | 8 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 3 | 5 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 4 | 15 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | not_detected | not_detected | not_detected | not_detected | not_detected | detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | not_detected | Unknown | Unknown |
Column names have been revised from that of the original dataset
# viewing a random sample of the dataset
df_model.sample(n=10, random_state=1)
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus | Influenza_A | Influenza_B | Parainfluenza_1 | CoronavirusNL63 | Rhinovirus_Enterovirus | Coronavirus_HKU1 | Parainfluenza_3 | Chlamydophila_pneumoniae | Adenovirus | Parainfluenza_4 | Coronavirus229E | CoronavirusOC43 | Inf_A_H1N1_2009 | Bordetella_pertussis | Metapneumovirus | Parainfluenza_2 | Influenza_B__rapid_test | Influenza_A__rapid_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4441 | 12 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1603 | 1 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1206 | 10 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1586 | 6 | 1 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 2730 | 16 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 3205 | 9 | 0 | 0 | 0 | 0 | 0.191 | 0.228 | -0.438 | 0.031 | 1.461 | 0.244 | 0.573 | 0.283 | 0.226 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | negative | negative |
| 5321 | 10 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 943 | 17 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 5029 | 10 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
| 1998 | 1 | 0 | 0 | 0 | 0 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown | Unknown |
A sample of 10 random cases to evaluate changes
df_model.shape
(5644, 33)
There are 33 columns and 5644 rows
df_model.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5644 entries, 0 to 5643 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Patient_age_quantile 5644 non-null int64 1 SARS_Cov_2_exam_result 5644 non-null int64 2 Patient_admitted_to_regular_ward_1=yes__0=no 5644 non-null int64 3 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 5644 non-null int64 4 Patient_admitted_to_intensive_care_unit_1=yes__0=no 5644 non-null int64 5 Hematocrit 5644 non-null float64 6 Hemoglobin 5644 non-null float64 7 Mean_platelet_volume 5644 non-null float64 8 Red_blood_Cells 5644 non-null float64 9 Lymphocytes 5644 non-null float64 10 Mean_corpuscular_hemoglobin_concentration MCHC 5644 non-null float64 11 Leukocytes 5644 non-null float64 12 Mean_corpuscular_hemoglobin_MCH 5644 non-null float64 13 Mean_corpuscular_volume_MCV 5644 non-null float64 14 Respiratory_Syncytial_Virus 5644 non-null category 15 Influenza_A 5644 non-null category 16 Influenza_B 5644 non-null category 17 Parainfluenza_1 5644 non-null category 18 CoronavirusNL63 5644 non-null category 19 Rhinovirus_Enterovirus 5644 non-null category 20 Coronavirus_HKU1 5644 non-null category 21 Parainfluenza_3 5644 non-null category 22 Chlamydophila_pneumoniae 5644 non-null category 23 Adenovirus 5644 non-null category 24 Parainfluenza_4 5644 non-null category 25 Coronavirus229E 5644 non-null category 26 CoronavirusOC43 5644 non-null category 27 Inf_A_H1N1_2009 5644 non-null category 28 Bordetella_pertussis 5644 non-null category 29 Metapneumovirus 5644 non-null category 30 Parainfluenza_2 5644 non-null category 31 Influenza_B__rapid_test 5644 non-null category 32 Influenza_A__rapid_test 5644 non-null category dtypes: category(19), float64(9), int64(5) memory usage: 724.6 KB
df_model.nunique()
Patient_age_quantile 20 SARS_Cov_2_exam_result 2 Patient_admitted_to_regular_ward_1=yes__0=no 2 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 2 Patient_admitted_to_intensive_care_unit_1=yes__0=no 2 Hematocrit 176 Hemoglobin 84 Mean_platelet_volume 48 Red_blood_Cells 211 Lymphocytes 318 Mean_corpuscular_hemoglobin_concentration MCHC 57 Leukocytes 476 Mean_corpuscular_hemoglobin_MCH 91 Mean_corpuscular_volume_MCV 190 Respiratory_Syncytial_Virus 3 Influenza_A 3 Influenza_B 3 Parainfluenza_1 3 CoronavirusNL63 3 Rhinovirus_Enterovirus 3 Coronavirus_HKU1 3 Parainfluenza_3 3 Chlamydophila_pneumoniae 3 Adenovirus 3 Parainfluenza_4 3 Coronavirus229E 3 CoronavirusOC43 3 Inf_A_H1N1_2009 3 Bordetella_pertussis 3 Metapneumovirus 3 Parainfluenza_2 2 Influenza_B__rapid_test 3 Influenza_A__rapid_test 3 dtype: int64
Summary of unique values per column
df_model.isna().sum()
Patient_age_quantile 0 SARS_Cov_2_exam_result 0 Patient_admitted_to_regular_ward_1=yes__0=no 0 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0 Patient_admitted_to_intensive_care_unit_1=yes__0=no 0 Hematocrit 0 Hemoglobin 0 Mean_platelet_volume 0 Red_blood_Cells 0 Lymphocytes 0 Mean_corpuscular_hemoglobin_concentration MCHC 0 Leukocytes 0 Mean_corpuscular_hemoglobin_MCH 0 Mean_corpuscular_volume_MCV 0 Respiratory_Syncytial_Virus 0 Influenza_A 0 Influenza_B 0 Parainfluenza_1 0 CoronavirusNL63 0 Rhinovirus_Enterovirus 0 Coronavirus_HKU1 0 Parainfluenza_3 0 Chlamydophila_pneumoniae 0 Adenovirus 0 Parainfluenza_4 0 Coronavirus229E 0 CoronavirusOC43 0 Inf_A_H1N1_2009 0 Bordetella_pertussis 0 Metapneumovirus 0 Parainfluenza_2 0 Influenza_B__rapid_test 0 Influenza_A__rapid_test 0 dtype: int64
There are no missing values
df_model.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Patient_age_quantile | 5644.000 | 9.318 | 5.778 | 0.000 | 4.000 | 9.000 | 14.000 | 19.000 |
| SARS_Cov_2_exam_result | 5644.000 | 0.099 | 0.299 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient_admitted_to_regular_ward_1=yes__0=no | 5644.000 | 0.014 | 0.117 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient_admitted_to_semi_intensive_unit_1=yes__0=no | 5644.000 | 0.009 | 0.094 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Patient_admitted_to_intensive_care_unit_1=yes__0=no | 5644.000 | 0.007 | 0.085 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Hematocrit | 5644.000 | 0.048 | 0.327 | -4.501 | 0.053 | 0.053 | 0.053 | 2.663 |
| Hemoglobin | 5644.000 | 0.036 | 0.327 | -4.346 | 0.040 | 0.040 | 0.040 | 2.672 |
| Mean_platelet_volume | 5644.000 | -0.091 | 0.327 | -2.458 | -0.102 | -0.102 | -0.102 | 3.713 |
| Red_blood_Cells | 5644.000 | 0.012 | 0.327 | -3.971 | 0.014 | 0.014 | 0.014 | 3.646 |
| Lymphocytes | 5644.000 | -0.013 | 0.327 | -1.865 | -0.014 | -0.014 | -0.014 | 3.764 |
| Mean_corpuscular_hemoglobin_concentration MCHC | 5644.000 | -0.049 | 0.327 | -5.432 | -0.055 | -0.055 | -0.055 | 3.331 |
| Leukocytes | 5644.000 | -0.190 | 0.333 | -2.020 | -0.213 | -0.213 | -0.213 | 4.522 |
| Mean_corpuscular_hemoglobin_MCH | 5644.000 | 0.112 | 0.329 | -5.938 | 0.126 | 0.126 | 0.126 | 4.099 |
| Mean_corpuscular_volume_MCV | 5644.000 | 0.059 | 0.327 | -5.102 | 0.066 | 0.066 | 0.066 | 3.411 |
Statistical table now provides a more comprehensive insight into the dataset
# Code for correlation table
corr2= df_model.corr()
corr2
| Patient_age_quantile | SARS_Cov_2_exam_result | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Patient_age_quantile | 1.000 | 0.075 | 0.046 | 0.016 | -0.036 | 0.026 | 0.015 | 0.049 | -0.014 | -0.039 | -0.034 | -0.031 | 0.050 | 0.084 |
| SARS_Cov_2_exam_result | 0.075 | 1.000 | 0.142 | 0.019 | 0.028 | 0.035 | 0.038 | 0.044 | 0.045 | -0.005 | 0.020 | -0.098 | -0.016 | -0.024 |
| Patient_admitted_to_regular_ward_1=yes__0=no | 0.046 | 0.142 | 1.000 | -0.011 | -0.010 | -0.084 | -0.085 | 0.012 | -0.047 | -0.075 | -0.016 | -0.035 | -0.070 | -0.047 |
| Patient_admitted_to_semi_intensive_unit_1=yes__0=no | 0.016 | 0.019 | -0.011 | 1.000 | -0.008 | -0.173 | -0.166 | 0.001 | -0.125 | -0.095 | -0.009 | 0.165 | -0.075 | -0.059 |
| Patient_admitted_to_intensive_care_unit_1=yes__0=no | -0.036 | 0.028 | -0.010 | -0.008 | 1.000 | -0.160 | -0.154 | -0.044 | -0.102 | -0.088 | -0.021 | 0.252 | -0.094 | -0.075 |
| Hematocrit | 0.026 | 0.035 | -0.084 | -0.173 | -0.160 | 1.000 | 0.968 | 0.078 | 0.872 | 0.001 | 0.128 | -0.098 | 0.081 | 0.028 |
| Hemoglobin | 0.015 | 0.038 | -0.085 | -0.166 | -0.154 | 0.968 | 1.000 | 0.075 | 0.841 | -0.005 | 0.369 | -0.108 | 0.188 | 0.030 |
| Mean_platelet_volume | 0.049 | 0.044 | 0.012 | 0.001 | -0.044 | 0.078 | 0.075 | 1.000 | 0.040 | 0.080 | 0.002 | -0.131 | 0.056 | 0.070 |
| Red_blood_Cells | -0.014 | 0.045 | -0.047 | -0.125 | -0.102 | 0.872 | 0.841 | 0.040 | 1.000 | -0.010 | 0.089 | -0.038 | -0.363 | -0.457 |
| Lymphocytes | -0.039 | -0.005 | -0.075 | -0.095 | -0.088 | 0.001 | -0.005 | 0.080 | -0.010 | 1.000 | -0.027 | -0.321 | 0.013 | 0.026 |
| Mean_corpuscular_hemoglobin_concentration MCHC | -0.034 | 0.020 | -0.016 | -0.009 | -0.021 | 0.128 | 0.369 | 0.002 | 0.089 | -0.027 | 1.000 | -0.055 | 0.464 | 0.032 |
| Leukocytes | -0.031 | -0.098 | -0.035 | 0.165 | 0.252 | -0.098 | -0.108 | -0.131 | -0.038 | -0.321 | -0.055 | 1.000 | -0.144 | -0.113 |
| Mean_corpuscular_hemoglobin_MCH | 0.050 | -0.016 | -0.070 | -0.075 | -0.094 | 0.081 | 0.188 | 0.056 | -0.363 | 0.013 | 0.464 | -0.144 | 1.000 | 0.895 |
| Mean_corpuscular_volume_MCV | 0.084 | -0.024 | -0.047 | -0.059 | -0.075 | 0.028 | 0.030 | 0.070 | -0.457 | 0.026 | 0.032 | -0.113 | 0.895 | 1.000 |
# Code for pairplots
sns.pairplot(df_model)
plt.show()
# Code for heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(corr2, annot=True, cmap="Spectral")
plt.show()
There is a strong positive correlation between the following variables;
labeled_barplot(df_model,'SARS_Cov_2_exam_result', perc= True)
plt.show()
df_model['SARS_Cov_2_exam_result'].value_counts()
0 5086 1 558 Name: SARS_Cov_2_exam_result, dtype: int64
Analytical approaches that may be fit to be applied to this problem are:
# Separating features and the target column
X = df_model.drop('SARS_Cov_2_exam_result', axis=1)
Y = df_model['SARS_Cov_2_exam_result']
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
# to ensure all variables are of float type
X = X.astype(float)
X.head()
| Patient_age_quantile | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus_detected | Respiratory_Syncytial_Virus_not_detected | Influenza_A_detected | Influenza_A_not_detected | Influenza_B_detected | Influenza_B_not_detected | Parainfluenza_1_detected | Parainfluenza_1_not_detected | CoronavirusNL63_detected | CoronavirusNL63_not_detected | Rhinovirus_Enterovirus_detected | Rhinovirus_Enterovirus_not_detected | Coronavirus_HKU1_detected | Coronavirus_HKU1_not_detected | Parainfluenza_3_detected | Parainfluenza_3_not_detected | Chlamydophila_pneumoniae_detected | Chlamydophila_pneumoniae_not_detected | Adenovirus_detected | Adenovirus_not_detected | Parainfluenza_4_detected | Parainfluenza_4_not_detected | Coronavirus229E_detected | Coronavirus229E_not_detected | CoronavirusOC43_detected | CoronavirusOC43_not_detected | Inf_A_H1N1_2009_detected | Inf_A_H1N1_2009_not_detected | Bordetella_pertussis_detected | Bordetella_pertussis_not_detected | Metapneumovirus_detected | Metapneumovirus_not_detected | Parainfluenza_2_not_detected | Influenza_B__rapid_test_negative | Influenza_B__rapid_test_positive | Influenza_A__rapid_test_negative | Influenza_A__rapid_test_positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 1 | 17.000 | 0.000 | 0.000 | 0.000 | 0.237 | -0.022 | 0.011 | 0.102 | 0.318 | -0.951 | -0.095 | -0.292 | 0.166 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.000 | 1.000 | 0.000 |
| 2 | 8.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 3 | 5.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 4 | 15.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1, shuffle=True)
# adding constant
X = sm.add_constant(X)
# splitting into a temporary and a test set (70:30)
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.3, random_state=1, stratify=Y
)
# split temp set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.35, random_state=1, stratify=y_temp
)
X_train.head()
| const | Patient_age_quantile | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Hematocrit | Hemoglobin | Mean_platelet_volume | Red_blood_Cells | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_hemoglobin_MCH | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus_detected | Respiratory_Syncytial_Virus_not_detected | Influenza_A_detected | Influenza_A_not_detected | Influenza_B_detected | Influenza_B_not_detected | Parainfluenza_1_detected | Parainfluenza_1_not_detected | CoronavirusNL63_detected | CoronavirusNL63_not_detected | Rhinovirus_Enterovirus_detected | Rhinovirus_Enterovirus_not_detected | Coronavirus_HKU1_detected | Coronavirus_HKU1_not_detected | Parainfluenza_3_detected | Parainfluenza_3_not_detected | Chlamydophila_pneumoniae_detected | Chlamydophila_pneumoniae_not_detected | Adenovirus_detected | Adenovirus_not_detected | Parainfluenza_4_detected | Parainfluenza_4_not_detected | Coronavirus229E_detected | Coronavirus229E_not_detected | CoronavirusOC43_detected | CoronavirusOC43_not_detected | Inf_A_H1N1_2009_detected | Inf_A_H1N1_2009_not_detected | Bordetella_pertussis_detected | Bordetella_pertussis_not_detected | Metapneumovirus_detected | Metapneumovirus_not_detected | Parainfluenza_2_not_detected | Influenza_B__rapid_test_negative | Influenza_B__rapid_test_positive | Influenza_A__rapid_test_negative | Influenza_A__rapid_test_positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5316 | 1.000 | 10.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 5006 | 1.000 | 11.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2433 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 |
| 5437 | 1.000 | 0.000 | 0.000 | 0.000 | 1.000 | -2.121 | -2.341 | -0.775 | -0.973 | 0.387 | -1.747 | 0.821 | -2.697 | -2.217 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 5032 | 1.000 | 9.000 | 0.000 | 0.000 | 0.000 | 0.053 | 0.040 | -0.102 | 0.014 | -0.014 | -0.055 | -0.213 | 0.126 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
index=[0],
)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results1 = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")
for name, model in models:
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
)
results1.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean()))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Cross-Validation Cost: dtree: 0.08258823529411764 Validation Performance: dtree: 0.06569343065693431
models = [] # Empty list to store all the models
# Appending models into the list
models.append(('Logistic Regression', LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic Regression: 3.9294117647058826 Bagging: 8.658823529411766 Random forest: 5.52156862745098 GBM: 8.658823529411764 Adaboost: 7.482352941176471 dtree: 8.258823529411764 Training Performance: Logistic Regression: 5.118110236220472 Bagging: 16.92913385826772 Random forest: 17.716535433070867 GBM: 16.535433070866144 Adaboost: 14.173228346456693 dtree: 16.92913385826772
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# fitting the model on training set
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==================================================================================
Dep. Variable: SARS_Cov_2_exam_result No. Observations: 2567
Model: Logit Df Residuals: 2533
Method: MLE Df Model: 33
Date: Sat, 11 Feb 2023 Pseudo R-squ.: 0.09229
Time: 01:00:10 Log-Likelihood: -752.07
converged: False LL-Null: -828.54
Covariance Type: nonrobust LLR p-value: 2.324e-17
=======================================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const -2.9542 0.340 -8.694 0.000 -3.620 -2.288
Patient_age_quantile 0.0368 0.013 2.935 0.003 0.012 0.061
Patient_admitted_to_regular_ward_1=yes__0=no 2.2675 0.484 4.682 0.000 1.318 3.217
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.8007 0.936 0.855 0.392 -1.034 2.635
Patient_admitted_to_intensive_care_unit_1=yes__0=no 2.8673 1.053 2.723 0.006 0.804 4.931
Hematocrit 4.2839 7.108 0.603 0.547 -9.648 18.216
Hemoglobin -3.5583 7.959 -0.447 0.655 -19.158 12.041
Mean_platelet_volume 0.1912 0.209 0.915 0.360 -0.218 0.601
Red_blood_Cells -0.9223 2.348 -0.393 0.695 -5.525 3.680
Lymphocytes -0.6403 0.291 -2.200 0.028 -1.211 -0.070
Mean_corpuscular_hemoglobin_concentration MCHC 0.4418 2.392 0.185 0.853 -4.246 5.130
Leukocytes -1.8201 0.403 -4.511 0.000 -2.611 -1.029
Mean_corpuscular_hemoglobin_MCH 1.2148 2.968 0.409 0.682 -4.602 7.032
Mean_corpuscular_volume_MCV -2.0011 3.148 -0.636 0.525 -8.171 4.169
Respiratory_Syncytial_Virus_detected -19.4941 9.37e+06 -2.08e-06 1.000 -1.84e+07 1.84e+07
Respiratory_Syncytial_Virus_not_detected 4.5391 9.36e+06 4.85e-07 1.000 -1.83e+07 1.83e+07
Influenza_A_detected -20.1388 4.61e+06 -4.37e-06 1.000 -9.03e+06 9.03e+06
Influenza_A_not_detected 5.1838 4.58e+06 1.13e-06 1.000 -8.97e+06 8.97e+06
Influenza_B_detected -8.2054 1.43e+06 -5.75e-06 1.000 -2.8e+06 2.8e+06
Influenza_B_not_detected -6.7496 1.49e+06 -4.52e-06 1.000 -2.93e+06 2.93e+06
Parainfluenza_1_detected -16.0470 nan nan nan nan nan
Parainfluenza_1_not_detected 1.0920 nan nan nan nan nan
CoronavirusNL63_detected -7.7202 1.34e+07 -5.77e-07 1.000 -2.62e+07 2.62e+07
CoronavirusNL63_not_detected -7.2348 1.33e+07 -5.42e-07 1.000 -2.62e+07 2.62e+07
Rhinovirus_Enterovirus_detected -8.7523 nan nan nan nan nan
Rhinovirus_Enterovirus_not_detected -6.2027 nan nan nan nan nan
Coronavirus_HKU1_detected -19.6165 4.03e+06 -4.87e-06 1.000 -7.9e+06 7.9e+06
Coronavirus_HKU1_not_detected 4.6615 4.02e+06 1.16e-06 1.000 -7.89e+06 7.89e+06
Parainfluenza_3_detected -17.3682 1.5e+07 -1.15e-06 1.000 -2.95e+07 2.95e+07
Parainfluenza_3_not_detected 2.4132 1.5e+07 1.6e-07 1.000 -2.95e+07 2.95e+07
Chlamydophila_pneumoniae_detected -24.9985 5.75e+07 -4.34e-07 1.000 -1.13e+08 1.13e+08
Chlamydophila_pneumoniae_not_detected 10.0435 1.77e+07 5.68e-07 1.000 -3.46e+07 3.46e+07
Adenovirus_detected -18.1330 nan nan nan nan nan
Adenovirus_not_detected 3.1780 nan nan nan nan nan
Parainfluenza_4_detected -16.2110 4.97e+06 -3.26e-06 1.000 -9.75e+06 9.75e+06
Parainfluenza_4_not_detected 1.2559 4.97e+06 2.53e-07 1.000 -9.75e+06 9.75e+06
Coronavirus229E_detected -14.1399 nan nan nan nan nan
Coronavirus229E_not_detected -0.8151 nan nan nan nan nan
CoronavirusOC43_detected -18.2074 6.48e+06 -2.81e-06 1.000 -1.27e+07 1.27e+07
CoronavirusOC43_not_detected 3.2524 6.47e+06 5.02e-07 1.000 -1.27e+07 1.27e+07
Inf_A_H1N1_2009_detected -17.4714 5.3e+06 -3.3e-06 1.000 -1.04e+07 1.04e+07
Inf_A_H1N1_2009_not_detected 2.5164 5.3e+06 4.75e-07 1.000 -1.04e+07 1.04e+07
Bordetella_pertussis_detected -15.0837 nan nan nan nan nan
Bordetella_pertussis_not_detected 0.1287 nan nan nan nan nan
Metapneumovirus_detected -12.6127 1.89e+07 -6.67e-07 1.000 -3.71e+07 3.71e+07
Metapneumovirus_not_detected -2.3423 1.89e+07 -1.24e-07 1.000 -3.71e+07 3.71e+07
Parainfluenza_2_not_detected -14.9550 2.33e+07 -6.41e-07 1.000 -4.57e+07 4.57e+07
Influenza_B__rapid_test_negative -1.3120 nan nan nan nan nan
Influenza_B__rapid_test_positive -21.2481 nan nan nan nan nan
Influenza_A__rapid_test_negative 1.1056 nan nan nan nan nan
Influenza_A__rapid_test_positive -23.6657 nan nan nan nan nan
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# predicting on training set
# default threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = lg.predict(X_train) > 0.5
pred_train = np.round(pred_train)
cm = confusion_matrix(y_train, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
The confusion matrix for training set
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: const 16.383 Patient_age_quantile 1.106 Patient_admitted_to_regular_ward_1=yes__0=no 1.175 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 1.168 Patient_admitted_to_intensive_care_unit_1=yes__0=no 1.148 Hematocrit 1247.367 Hemoglobin 1362.359 Mean_platelet_volume 1.079 Red_blood_Cells 96.796 Lymphocytes 1.211 Mean_corpuscular_hemoglobin_concentration MCHC 85.543 Leukocytes 1.326 Mean_corpuscular_hemoglobin_MCH 149.056 Mean_corpuscular_volume_MCV 157.850 Respiratory_Syncytial_Virus_detected inf Respiratory_Syncytial_Virus_not_detected inf Influenza_A_detected inf Influenza_A_not_detected inf Influenza_B_detected inf Influenza_B_not_detected inf Parainfluenza_1_detected inf Parainfluenza_1_not_detected inf CoronavirusNL63_detected inf CoronavirusNL63_not_detected inf Rhinovirus_Enterovirus_detected inf Rhinovirus_Enterovirus_not_detected inf Coronavirus_HKU1_detected inf Coronavirus_HKU1_not_detected inf Parainfluenza_3_detected inf Parainfluenza_3_not_detected inf Chlamydophila_pneumoniae_detected inf Chlamydophila_pneumoniae_not_detected inf Adenovirus_detected inf Adenovirus_not_detected inf Parainfluenza_4_detected inf Parainfluenza_4_not_detected inf Coronavirus229E_detected inf Coronavirus229E_not_detected inf CoronavirusOC43_detected inf CoronavirusOC43_not_detected inf Inf_A_H1N1_2009_detected inf Inf_A_H1N1_2009_not_detected inf Bordetella_pertussis_detected inf Bordetella_pertussis_not_detected inf Metapneumovirus_detected inf Metapneumovirus_not_detected inf Parainfluenza_2_not_detected inf Influenza_B__rapid_test_negative inf Influenza_B__rapid_test_positive inf Influenza_A__rapid_test_negative inf Influenza_A__rapid_test_positive inf dtype: float64
Features with VIF greater than 5 are:
X_train2 = X_train.drop(["Hematocrit"], axis=1)
logit = sm.Logit(y_train, X_train2.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==================================================================================
Dep. Variable: SARS_Cov_2_exam_result No. Observations: 2567
Model: Logit Df Residuals: 2534
Method: MLE Df Model: 32
Date: Sat, 11 Feb 2023 Pseudo R-squ.: -0.3301
Time: 01:00:10 Log-Likelihood: -1102.1
converged: False LL-Null: -828.54
Covariance Type: nonrobust LLR p-value: 1.000
=======================================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const -2.9533 0.344 -8.583 0.000 -3.628 -2.279
Patient_age_quantile 0.0365 0.013 2.915 0.004 0.012 0.061
Patient_admitted_to_regular_ward_1=yes__0=no 2.2434 0.485 4.630 0.000 1.294 3.193
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.7503 0.942 0.796 0.426 -1.096 2.597
Patient_admitted_to_intensive_care_unit_1=yes__0=no 2.7938 1.035 2.699 0.007 0.765 4.823
Hemoglobin 0.9961 2.366 0.421 0.674 -3.641 5.633
Mean_platelet_volume 0.1944 0.209 0.930 0.352 -0.215 0.604
Red_blood_Cells -0.9173 2.470 -0.371 0.710 -5.759 3.924
Lymphocytes -0.6542 0.291 -2.250 0.024 -1.224 -0.084
Mean_corpuscular_hemoglobin_concentration MCHC -0.6593 1.602 -0.412 0.681 -3.799 2.481
Leukocytes -1.7881 0.400 -4.469 0.000 -2.572 -1.004
Mean_corpuscular_hemoglobin_MCH 1.0910 2.990 0.365 0.715 -4.769 6.951
Mean_corpuscular_volume_MCV -1.8656 3.237 -0.576 0.564 -8.209 4.478
Respiratory_Syncytial_Virus_detected -18.1605 4.88e+06 -3.72e-06 1.000 -9.57e+06 9.57e+06
Respiratory_Syncytial_Virus_not_detected -3.9729 4.88e+06 -8.15e-07 1.000 -9.56e+06 9.56e+06
Influenza_A_detected -21.8960 nan nan nan nan nan
Influenza_A_not_detected -0.2374 nan nan nan nan nan
Influenza_B_detected -11.7725 nan nan nan nan nan
Influenza_B_not_detected -10.3609 nan nan nan nan nan
Parainfluenza_1_detected -140.6339 5.03e+56 -2.79e-55 1.000 -9.86e+56 9.86e+56
Parainfluenza_1_not_detected 118.5004 5.05e+06 2.35e-05 1.000 -9.89e+06 9.89e+06
CoronavirusNL63_detected -11.2947 nan nan nan nan nan
CoronavirusNL63_not_detected -10.8387 nan nan nan nan nan
Rhinovirus_Enterovirus_detected -12.3511 2.28e+06 -5.42e-06 1.000 -4.46e+06 4.46e+06
Rhinovirus_Enterovirus_not_detected -9.7823 2.28e+06 -4.29e-06 1.000 -4.46e+06 4.46e+06
Coronavirus_HKU1_detected -14.1479 4.51e+06 -3.14e-06 1.000 -8.83e+06 8.83e+06
Coronavirus_HKU1_not_detected -7.9855 4.51e+06 -1.77e-06 1.000 -8.83e+06 8.83e+06
Parainfluenza_3_detected -21.6833 1.03e+06 -2.1e-05 1.000 -2.02e+06 2.02e+06
Parainfluenza_3_not_detected -0.4501 1.03e+06 -4.37e-07 1.000 -2.02e+06 2.02e+06
Chlamydophila_pneumoniae_detected -14.6920 nan nan nan nan nan
Chlamydophila_pneumoniae_not_detected -7.4414 nan nan nan nan nan
Adenovirus_detected -17.7638 nan nan nan nan nan
Adenovirus_not_detected -4.3696 nan nan nan nan nan
Parainfluenza_4_detected 9.9741 nan nan nan nan nan
Parainfluenza_4_not_detected -32.1075 nan nan nan nan nan
Coronavirus229E_detected -15.6418 4.25e+06 -3.68e-06 1.000 -8.33e+06 8.33e+06
Coronavirus229E_not_detected -6.4916 4.25e+06 -1.53e-06 1.000 -8.33e+06 8.33e+06
CoronavirusOC43_detected -22.0646 7.03e+06 -3.14e-06 1.000 -1.38e+07 1.38e+07
CoronavirusOC43_not_detected -0.0688 7.03e+06 -9.79e-09 1.000 -1.38e+07 1.38e+07
Inf_A_H1N1_2009_detected -29.9233 6.34e+07 -4.72e-07 1.000 -1.24e+08 1.24e+08
Inf_A_H1N1_2009_not_detected 7.7899 2.25e+07 3.47e-07 1.000 -4.4e+07 4.4e+07
Bordetella_pertussis_detected -15.0491 3.79e+06 -3.97e-06 1.000 -7.43e+06 7.43e+06
Bordetella_pertussis_not_detected -7.0843 3.79e+06 -1.87e-06 1.000 -7.43e+06 7.43e+06
Metapneumovirus_detected -19.1391 3.83e+06 -5e-06 1.000 -7.5e+06 7.5e+06
Metapneumovirus_not_detected -2.9943 3.83e+06 -7.83e-07 1.000 -7.5e+06 7.5e+06
Parainfluenza_2_not_detected -22.1334 1.72e+07 -1.28e-06 1.000 -3.38e+07 3.38e+07
Influenza_B__rapid_test_negative 11.4975 8.48e+06 1.36e-06 1.000 -1.66e+07 1.66e+07
Influenza_B__rapid_test_positive -44.6537 1.26e+12 -3.56e-11 1.000 -2.46e+12 2.46e+12
Influenza_A__rapid_test_negative -11.6972 8.48e+06 -1.38e-06 1.000 -1.66e+07 1.66e+07
Influenza_A__rapid_test_positive -21.4589 8.48e+06 -2.53e-06 1.000 -1.66e+07 1.66e+07
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
index=X_train2.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: const 16.331 Patient_age_quantile 1.104 Patient_admitted_to_regular_ward_1=yes__0=no 1.168 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 1.164 Patient_admitted_to_intensive_care_unit_1=yes__0=no 1.146 Hemoglobin 79.025 Mean_platelet_volume 1.071 Red_blood_Cells 90.763 Lymphocytes 1.196 Mean_corpuscular_hemoglobin_concentration MCHC 30.896 Leukocytes 1.324 Mean_corpuscular_hemoglobin_MCH 143.537 Mean_corpuscular_volume_MCV 145.782 Respiratory_Syncytial_Virus_detected inf Respiratory_Syncytial_Virus_not_detected inf Influenza_A_detected inf Influenza_A_not_detected inf Influenza_B_detected inf Influenza_B_not_detected inf Parainfluenza_1_detected inf Parainfluenza_1_not_detected inf CoronavirusNL63_detected inf CoronavirusNL63_not_detected inf Rhinovirus_Enterovirus_detected inf Rhinovirus_Enterovirus_not_detected inf Coronavirus_HKU1_detected inf Coronavirus_HKU1_not_detected inf Parainfluenza_3_detected inf Parainfluenza_3_not_detected inf Chlamydophila_pneumoniae_detected inf Chlamydophila_pneumoniae_not_detected inf Adenovirus_detected inf Adenovirus_not_detected inf Parainfluenza_4_detected inf Parainfluenza_4_not_detected inf Coronavirus229E_detected inf Coronavirus229E_not_detected inf CoronavirusOC43_detected inf CoronavirusOC43_not_detected inf Inf_A_H1N1_2009_detected inf Inf_A_H1N1_2009_not_detected inf Bordetella_pertussis_detected inf Bordetella_pertussis_not_detected inf Metapneumovirus_detected inf Metapneumovirus_not_detected inf Parainfluenza_2_not_detected inf Influenza_B__rapid_test_negative inf Influenza_B__rapid_test_positive inf Influenza_A__rapid_test_negative inf Influenza_A__rapid_test_positive inf dtype: float64
### Dropping all variables with VIF >5
X_train3 = X_train2.drop(["Red_blood_Cells", "Hemoglobin","Mean_corpuscular_hemoglobin_MCH"], axis=1)
logit = sm.Logit(y_train, X_train3.astype(float))
lg = logit.fit(
disp=False
) # setting disp=False will remove the information on number of iterations
print(lg.summary())
Logit Regression Results
==================================================================================
Dep. Variable: SARS_Cov_2_exam_result No. Observations: 2567
Model: Logit Df Residuals: 2537
Method: MLE Df Model: 29
Date: Sat, 11 Feb 2023 Pseudo R-squ.: 0.09159
Time: 01:00:11 Log-Likelihood: -752.65
converged: False LL-Null: -828.54
Covariance Type: nonrobust LLR p-value: 1.397e-18
=======================================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const -2.8293 0.173 -16.309 0.000 -3.169 -2.489
Patient_age_quantile 0.0363 0.012 2.920 0.004 0.012 0.061
Patient_admitted_to_regular_ward_1=yes__0=no 2.1344 0.444 4.811 0.000 1.265 3.004
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.7016 0.904 0.776 0.438 -1.070 2.473
Patient_admitted_to_intensive_care_unit_1=yes__0=no 2.6826 1.008 2.660 0.008 0.706 4.659
Mean_platelet_volume 0.1808 0.199 0.909 0.363 -0.209 0.571
Lymphocytes -0.6457 0.280 -2.308 0.021 -1.194 -0.097
Mean_corpuscular_hemoglobin_concentration MCHC 0.1107 0.233 0.475 0.635 -0.346 0.568
Leukocytes -1.7292 0.375 -4.615 0.000 -2.463 -0.995
Mean_corpuscular_volume_MCV -0.4449 0.243 -1.833 0.067 -0.921 0.031
Respiratory_Syncytial_Virus_detected -17.0293 nan nan nan nan nan
Respiratory_Syncytial_Virus_not_detected 4.4555 nan nan nan nan nan
Influenza_A_detected -14.4232 1.15e+07 -1.25e-06 1.000 -2.25e+07 2.25e+07
Influenza_A_not_detected 1.8493 1.15e+07 1.61e-07 1.000 -2.25e+07 2.25e+07
Influenza_B_detected -7.0000 nan nan nan nan nan
Influenza_B_not_detected -5.5738 nan nan nan nan nan
Parainfluenza_1_detected -13.8470 1.06e+07 -1.3e-06 1.000 -2.08e+07 2.08e+07
Parainfluenza_1_not_detected 1.2732 1.07e+07 1.19e-07 1.000 -2.09e+07 2.09e+07
CoronavirusNL63_detected -6.5220 nan nan nan nan nan
CoronavirusNL63_not_detected -6.0518 nan nan nan nan nan
Rhinovirus_Enterovirus_detected -7.5791 nan nan nan nan nan
Rhinovirus_Enterovirus_not_detected -4.9947 nan nan nan nan nan
Coronavirus_HKU1_detected -12.6400 nan nan nan nan nan
Coronavirus_HKU1_not_detected 0.0662 nan nan nan nan nan
Parainfluenza_3_detected -13.7939 1.96e+07 -7.04e-07 1.000 -3.84e+07 3.84e+07
Parainfluenza_3_not_detected 1.2200 1.96e+07 6.23e-08 1.000 -3.84e+07 3.84e+07
Chlamydophila_pneumoniae_detected -15.9385 2.34e+07 -6.81e-07 1.000 -4.59e+07 4.59e+07
Chlamydophila_pneumoniae_not_detected 3.3647 2.34e+07 1.44e-07 1.000 -4.59e+07 4.59e+07
Adenovirus_detected -12.1188 nan nan nan nan nan
Adenovirus_not_detected -0.4550 nan nan nan nan nan
Parainfluenza_4_detected -12.9639 1.53e+07 -8.47e-07 1.000 -3e+07 3e+07
Parainfluenza_4_not_detected 0.3901 1.53e+07 2.55e-08 1.000 -3e+07 3e+07
Coronavirus229E_detected -16.1049 7.92e+06 -2.03e-06 1.000 -1.55e+07 1.55e+07
Coronavirus229E_not_detected 3.5311 7.92e+06 4.46e-07 1.000 -1.55e+07 1.55e+07
CoronavirusOC43_detected -15.4096 nan nan nan nan nan
CoronavirusOC43_not_detected 2.8358 nan nan nan nan nan
Inf_A_H1N1_2009_detected -17.9507 6.19e+06 -2.9e-06 1.000 -1.21e+07 1.21e+07
Inf_A_H1N1_2009_not_detected 5.3769 6.19e+06 8.68e-07 1.000 -1.21e+07 1.21e+07
Bordetella_pertussis_detected -14.4267 nan nan nan nan nan
Bordetella_pertussis_not_detected 1.8529 nan nan nan nan nan
Metapneumovirus_detected -15.9689 nan nan nan nan nan
Metapneumovirus_not_detected 3.3951 nan nan nan nan nan
Parainfluenza_2_not_detected -12.5738 nan nan nan nan nan
Influenza_B__rapid_test_negative -5.5340 6.72e+07 -8.23e-08 1.000 -1.32e+08 1.32e+08
Influenza_B__rapid_test_positive -7.6429 6.72e+07 -1.14e-07 1.000 -1.32e+08 1.32e+08
Influenza_A__rapid_test_negative 5.3317 6.72e+07 7.93e-08 1.000 -1.32e+08 1.32e+08
Influenza_A__rapid_test_positive -18.5085 6.72e+07 -2.75e-07 1.000 -1.32e+08 1.32e+08
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
index=X_train3.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: const 5.222 Patient_age_quantile 1.087 Patient_admitted_to_regular_ward_1=yes__0=no 1.059 Patient_admitted_to_semi_intensive_unit_1=yes__0=no 1.116 Patient_admitted_to_intensive_care_unit_1=yes__0=no 1.100 Mean_platelet_volume 1.057 Lymphocytes 1.165 Mean_corpuscular_hemoglobin_concentration MCHC 1.026 Leukocytes 1.272 Mean_corpuscular_volume_MCV 1.029 Respiratory_Syncytial_Virus_detected inf Respiratory_Syncytial_Virus_not_detected inf Influenza_A_detected inf Influenza_A_not_detected inf Influenza_B_detected inf Influenza_B_not_detected inf Parainfluenza_1_detected inf Parainfluenza_1_not_detected inf CoronavirusNL63_detected inf CoronavirusNL63_not_detected inf Rhinovirus_Enterovirus_detected inf Rhinovirus_Enterovirus_not_detected inf Coronavirus_HKU1_detected inf Coronavirus_HKU1_not_detected inf Parainfluenza_3_detected inf Parainfluenza_3_not_detected inf Chlamydophila_pneumoniae_detected inf Chlamydophila_pneumoniae_not_detected inf Adenovirus_detected inf Adenovirus_not_detected inf Parainfluenza_4_detected inf Parainfluenza_4_not_detected inf Coronavirus229E_detected inf Coronavirus229E_not_detected inf CoronavirusOC43_detected inf CoronavirusOC43_not_detected inf Inf_A_H1N1_2009_detected inf Inf_A_H1N1_2009_not_detected inf Bordetella_pertussis_detected inf Bordetella_pertussis_not_detected inf Metapneumovirus_detected inf Metapneumovirus_not_detected inf Parainfluenza_2_not_detected inf Influenza_B__rapid_test_negative inf Influenza_B__rapid_test_positive inf Influenza_A__rapid_test_negative inf Influenza_A__rapid_test_positive inf dtype: float64
# converting coefficients to odds
odds = np.exp(lg.params)
# finding the percentage change
perc_change_odds = (np.exp(lg.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
| const | Patient_age_quantile | Patient_admitted_to_regular_ward_1=yes__0=no | Patient_admitted_to_semi_intensive_unit_1=yes__0=no | Patient_admitted_to_intensive_care_unit_1=yes__0=no | Mean_platelet_volume | Lymphocytes | Mean_corpuscular_hemoglobin_concentration MCHC | Leukocytes | Mean_corpuscular_volume_MCV | Respiratory_Syncytial_Virus_detected | Respiratory_Syncytial_Virus_not_detected | Influenza_A_detected | Influenza_A_not_detected | Influenza_B_detected | Influenza_B_not_detected | Parainfluenza_1_detected | Parainfluenza_1_not_detected | CoronavirusNL63_detected | CoronavirusNL63_not_detected | Rhinovirus_Enterovirus_detected | Rhinovirus_Enterovirus_not_detected | Coronavirus_HKU1_detected | Coronavirus_HKU1_not_detected | Parainfluenza_3_detected | Parainfluenza_3_not_detected | Chlamydophila_pneumoniae_detected | Chlamydophila_pneumoniae_not_detected | Adenovirus_detected | Adenovirus_not_detected | Parainfluenza_4_detected | Parainfluenza_4_not_detected | Coronavirus229E_detected | Coronavirus229E_not_detected | CoronavirusOC43_detected | CoronavirusOC43_not_detected | Inf_A_H1N1_2009_detected | Inf_A_H1N1_2009_not_detected | Bordetella_pertussis_detected | Bordetella_pertussis_not_detected | Metapneumovirus_detected | Metapneumovirus_not_detected | Parainfluenza_2_not_detected | Influenza_B__rapid_test_negative | Influenza_B__rapid_test_positive | Influenza_A__rapid_test_negative | Influenza_A__rapid_test_positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.059 | 1.037 | 8.452 | 2.017 | 14.623 | 1.198 | 0.524 | 1.117 | 0.177 | 0.641 | 0.000 | 86.097 | 0.000 | 6.356 | 0.001 | 0.004 | 0.000 | 3.572 | 0.001 | 0.002 | 0.001 | 0.007 | 0.000 | 1.068 | 0.000 | 3.387 | 0.000 | 28.924 | 0.000 | 0.634 | 0.000 | 1.477 | 0.000 | 34.163 | 0.000 | 17.045 | 0.000 | 216.355 | 0.000 | 6.378 | 0.000 | 29.817 | 0.000 | 0.004 | 0.000 | 206.793 | 0.000 |
| Change_odd% | -94.094 | 3.702 | 745.212 | 101.693 | 1362.333 | 19.820 | -47.568 | 11.706 | -82.257 | -35.911 | -100.000 | 8509.704 | -100.000 | 535.565 | -99.909 | -99.620 | -100.000 | 257.242 | -99.853 | -99.765 | -99.949 | -99.323 | -100.000 | 6.839 | -100.000 | 238.734 | -100.000 | 2792.426 | -99.999 | -36.554 | -100.000 | 47.710 | -100.000 | 3316.288 | -100.000 | 1604.452 | -100.000 | 21535.479 | -100.000 | 537.845 | -100.000 | 2881.663 | -100.000 | -99.605 | -99.952 | 20579.273 | -100.000 |
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_train3, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg, X_train3, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.905 | 0.067 | 0.739 | 0.123 |
Although the accuracy and pecision of the model are fairly good, recall is poor (6.7%)
logit_roc_auc_train = roc_auc_score(y_train, lg.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
y_scores = lg.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
At threshold of about 0.16, precision=recall
# setting the threshold
optimal_threshold_curve = 0.16
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_train3, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg, X_train3, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.898 | 0.142 | 0.444 | 0.215 |
By using the optimal threshold;
X_val_3 = X_val[X_train3.columns].astype(float)
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_val_3, y_val, threshold=optimal_threshold_curve)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg, X_val_3, y_val, threshold=optimal_threshold_curve
)
print("Validation set performance:")
log_reg_model_test_perf
Validation set performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.892 | 0.131 | 0.375 | 0.195 |
X_test3 = X_test[X_train3.columns].astype(float)
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_test3, y_test)
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg, X_test3, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.901 | 0.162 | 0.500 | 0.244 |
print(lg.summary())
Logit Regression Results
==================================================================================
Dep. Variable: SARS_Cov_2_exam_result No. Observations: 2567
Model: Logit Df Residuals: 2537
Method: MLE Df Model: 29
Date: Sat, 11 Feb 2023 Pseudo R-squ.: 0.09159
Time: 01:00:13 Log-Likelihood: -752.65
converged: False LL-Null: -828.54
Covariance Type: nonrobust LLR p-value: 1.397e-18
=======================================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------------------
const -2.8293 0.173 -16.309 0.000 -3.169 -2.489
Patient_age_quantile 0.0363 0.012 2.920 0.004 0.012 0.061
Patient_admitted_to_regular_ward_1=yes__0=no 2.1344 0.444 4.811 0.000 1.265 3.004
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.7016 0.904 0.776 0.438 -1.070 2.473
Patient_admitted_to_intensive_care_unit_1=yes__0=no 2.6826 1.008 2.660 0.008 0.706 4.659
Mean_platelet_volume 0.1808 0.199 0.909 0.363 -0.209 0.571
Lymphocytes -0.6457 0.280 -2.308 0.021 -1.194 -0.097
Mean_corpuscular_hemoglobin_concentration MCHC 0.1107 0.233 0.475 0.635 -0.346 0.568
Leukocytes -1.7292 0.375 -4.615 0.000 -2.463 -0.995
Mean_corpuscular_volume_MCV -0.4449 0.243 -1.833 0.067 -0.921 0.031
Respiratory_Syncytial_Virus_detected -17.0293 nan nan nan nan nan
Respiratory_Syncytial_Virus_not_detected 4.4555 nan nan nan nan nan
Influenza_A_detected -14.4232 1.15e+07 -1.25e-06 1.000 -2.25e+07 2.25e+07
Influenza_A_not_detected 1.8493 1.15e+07 1.61e-07 1.000 -2.25e+07 2.25e+07
Influenza_B_detected -7.0000 nan nan nan nan nan
Influenza_B_not_detected -5.5738 nan nan nan nan nan
Parainfluenza_1_detected -13.8470 1.06e+07 -1.3e-06 1.000 -2.08e+07 2.08e+07
Parainfluenza_1_not_detected 1.2732 1.07e+07 1.19e-07 1.000 -2.09e+07 2.09e+07
CoronavirusNL63_detected -6.5220 nan nan nan nan nan
CoronavirusNL63_not_detected -6.0518 nan nan nan nan nan
Rhinovirus_Enterovirus_detected -7.5791 nan nan nan nan nan
Rhinovirus_Enterovirus_not_detected -4.9947 nan nan nan nan nan
Coronavirus_HKU1_detected -12.6400 nan nan nan nan nan
Coronavirus_HKU1_not_detected 0.0662 nan nan nan nan nan
Parainfluenza_3_detected -13.7939 1.96e+07 -7.04e-07 1.000 -3.84e+07 3.84e+07
Parainfluenza_3_not_detected 1.2200 1.96e+07 6.23e-08 1.000 -3.84e+07 3.84e+07
Chlamydophila_pneumoniae_detected -15.9385 2.34e+07 -6.81e-07 1.000 -4.59e+07 4.59e+07
Chlamydophila_pneumoniae_not_detected 3.3647 2.34e+07 1.44e-07 1.000 -4.59e+07 4.59e+07
Adenovirus_detected -12.1188 nan nan nan nan nan
Adenovirus_not_detected -0.4550 nan nan nan nan nan
Parainfluenza_4_detected -12.9639 1.53e+07 -8.47e-07 1.000 -3e+07 3e+07
Parainfluenza_4_not_detected 0.3901 1.53e+07 2.55e-08 1.000 -3e+07 3e+07
Coronavirus229E_detected -16.1049 7.92e+06 -2.03e-06 1.000 -1.55e+07 1.55e+07
Coronavirus229E_not_detected 3.5311 7.92e+06 4.46e-07 1.000 -1.55e+07 1.55e+07
CoronavirusOC43_detected -15.4096 nan nan nan nan nan
CoronavirusOC43_not_detected 2.8358 nan nan nan nan nan
Inf_A_H1N1_2009_detected -17.9507 6.19e+06 -2.9e-06 1.000 -1.21e+07 1.21e+07
Inf_A_H1N1_2009_not_detected 5.3769 6.19e+06 8.68e-07 1.000 -1.21e+07 1.21e+07
Bordetella_pertussis_detected -14.4267 nan nan nan nan nan
Bordetella_pertussis_not_detected 1.8529 nan nan nan nan nan
Metapneumovirus_detected -15.9689 nan nan nan nan nan
Metapneumovirus_not_detected 3.3951 nan nan nan nan nan
Parainfluenza_2_not_detected -12.5738 nan nan nan nan nan
Influenza_B__rapid_test_negative -5.5340 6.72e+07 -8.23e-08 1.000 -1.32e+08 1.32e+08
Influenza_B__rapid_test_positive -7.6429 6.72e+07 -1.14e-07 1.000 -1.32e+08 1.32e+08
Influenza_A__rapid_test_negative 5.3317 6.72e+07 7.93e-08 1.000 -1.32e+08 1.32e+08
Influenza_A__rapid_test_positive -18.5085 6.72e+07 -2.75e-07 1.000 -1.32e+08 1.32e+08
=======================================================================================================================
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
return score_list # returning the list with train and test scores
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.918 | 0.169 | 1.000 | 0.290 |
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.897 | 0.084 | 0.400 | 0.139 |
In the test set;
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The top 5 features of importance are:
#base_estimator for bagging classifier is a decision tree by default
bagging_estimator=BaggingClassifier(random_state=1)
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(random_state=1)
make_confusion_matrix(bagging_estimator,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_score=get_metrics_score(bagging_estimator)
Accuracy on training set : 0.9170237631476431 Accuracy on test set : 0.9020070838252656 Recall on training set : 0.16929133858267717 Recall on test set : 0.07784431137724551 Precision on training set : 0.9555555555555556 Precision on test set : 0.52
#Train the random forest classifier
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
make_confusion_matrix(rf_estimator,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_score=get_metrics_score(rf_estimator)
Accuracy on training set : 0.917802882742501 Accuracy on test set : 0.9061393152302243 Recall on training set : 0.17716535433070865 Recall on test set : 0.059880239520958084 Precision on training set : 0.9574468085106383 Precision on test set : 0.8333333333333334
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=40,
random_state=1)
make_confusion_matrix(bagging_estimator_tuned, y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator_tuned)
Accuracy on training set : 0.917802882742501 Accuracy on test set : 0.9020070838252656 Recall on training set : 0.16929133858267717 Recall on test set : 0.0658682634730539 Precision on training set : 1.0 Precision on test set : 0.5238095238095238
The bagging classifier is about the same after hypertuning
bagging_lr=BaggingClassifier(base_estimator=LogisticRegression(solver='liblinear',random_state=1,max_iter=1000),random_state=1)
bagging_lr.fit(X_train,y_train)
BaggingClassifier(base_estimator=LogisticRegression(max_iter=1000,
random_state=1,
solver='liblinear'),
random_state=1)
make_confusion_matrix(bagging_lr,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_lr_score=get_metrics_score(bagging_lr)
Accuracy on training set : 0.9045578496299181 Accuracy on test set : 0.9037780401416765 Recall on training set : 0.051181102362204724 Recall on test set : 0.059880239520958084 Precision on training set : 0.7647058823529411 Precision on test set : 0.625
# Choose the type of classifier.
rf_estimator_tuned = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {"n_estimators": [15,26,5],
"min_samples_leaf": np.arange(5, 10),
"max_features": ['sqrt', 'log2'],
"max_samples": np.arange(5, 10, 5),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features='sqrt', max_samples=5, min_samples_leaf=5,
n_estimators=15, random_state=1)
make_confusion_matrix(rf_estimator_tuned,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_tuned_score=get_metrics_score(rf_estimator_tuned)
Accuracy on training set : 0.901051811453058 Accuracy on test set : 0.9014167650531287 Recall on training set : 0.0 Recall on test set : 0.0 Precision on training set : 0.0 Precision on test set : 0.0
Although this model has is not overfitting, it has poor precision and recall
rf_wt = RandomForestClassifier(class_weight={0:0.4,1:0.6}, random_state=1)
rf_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.4, 1: 0.6}, random_state=1)
confusion_matrix_sklearn(rf_wt, X_test,y_test)
rf_wt_model_train_perf=model_performance_classification_sklearn(rf_wt, X_train,y_train)
print("Training performance \n",rf_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.917 0.185 0.904 0.307
rf_wt_model_test_perf=model_performance_classification_sklearn(rf_wt, X_test,y_test)
print("Testing performance \n",rf_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.904 0.048 0.667 0.089
Precision and recall have improved in the weighted class model
importances = rf_wt.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
The top 5 features in the class-weights random forest model are:
The model will be boosted with:
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)
make_confusion_matrix(abc,y_test)
#Code to determine accuracy, recall and precision on train and test set
abc_score=get_metrics_score(abc)
Accuracy on training set : 0.9139072847682119 Accuracy on test set : 0.9031877213695395 Recall on training set : 0.14173228346456693 Recall on test set : 0.0718562874251497 Precision on training set : 0.9230769230769231 Precision on test set : 0.5714285714285714
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)
make_confusion_matrix(gbc,y_test)
#Determining accuracy, recall and precision on train and test set
gbc_score=get_metrics_score(gbc)
Accuracy on training set : 0.9170237631476431 Accuracy on test set : 0.9014167650531287 Recall on training set : 0.16535433070866143 Recall on test set : 0.07784431137724551 Precision on training set : 0.9767441860465116 Precision on test set : 0.5
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1),DecisionTreeClassifier(max_depth=3, random_state=1)],
"n_estimators": np.arange(15,26,5),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.8, n_estimators=15, random_state=1)
make_confusion_matrix(abc_tuned,y_test)
#Using above defined function to get accuracy, recall and precision on train and test set
abc_tuned_score=get_metrics_score(abc_tuned)
Accuracy on training set : 0.9170237631476431 Accuracy on test set : 0.9020070838252656 Recall on training set : 0.16535433070866143 Recall on test set : 0.04790419161676647 Precision on training set : 0.9767441860465116 Precision on test set : 0.5333333333333333
#Using AdaBoost classifier as the estimator for initial predictions
gbc_init = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
gbc_init.fit(X_train,y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
random_state=1)
gbc_init_score=get_metrics_score(gbc_init)
Accuracy on training set : 0.9166342033502143 Accuracy on test set : 0.9020070838252656 Recall on training set : 0.16141732283464566 Recall on test set : 0.07784431137724551 Precision on training set : 0.9761904761904762 Precision on test set : 0.52
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [15,26,5],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.9, n_estimators=26, random_state=1,
subsample=1)
make_confusion_matrix(gbc_tuned,y_test)
#Accuracy, recall and precision on train and test set
gbc_tuned_score=get_metrics_score(gbc_tuned)
Accuracy on training set : 0.9123490455784963 Accuracy on test set : 0.9025974025974026 Recall on training set : 0.11811023622047244 Recall on test set : 0.0718562874251497 Precision on training set : 0.967741935483871 Precision on test set : 0.5454545454545454
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
The top 5 features after hyperparameter tuning are as follows;
# defining list of models
models = [bagging_estimator,bagging_estimator_tuned,bagging_lr,rf_estimator,rf_estimator_tuned,
rf_wt]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
comparison_frame = pd.DataFrame({'Model':['Bagging classifier with default parameters','Tuned Bagging Classifier',
'Bagging classifier with base_estimator=LR', 'Random Forest with deafult parameters',
'Tuned Random Forest Classifier','Random Forest with class_weights'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | Bagging classifier with default parameters | 0.920 | 0.900 | 0.170 | 0.080 | 0.960 | 0.520 |
| 1 | Tuned Bagging Classifier | 0.920 | 0.900 | 0.170 | 0.070 | 1.000 | 0.520 |
| 2 | Bagging classifier with base_estimator=LR | 0.900 | 0.900 | 0.050 | 0.060 | 0.760 | 0.620 |
| 3 | Random Forest with deafult parameters | 0.920 | 0.910 | 0.180 | 0.060 | 0.960 | 0.830 |
| 4 | Tuned Random Forest Classifier | 0.900 | 0.900 | 0.000 | 0.000 | 0.000 | 0.000 |
| 5 | Random Forest with class_weights | 0.920 | 0.900 | 0.190 | 0.050 | 0.900 | 0.670 |
# defining list of models
models = [abc, abc_tuned, gbc, gbc_init, gbc_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
comparison_frame = pd.DataFrame({'Model':['AdaBoost with default paramters','AdaBoost Tuned',
'Gradient Boosting with default parameters','Gradient Boosting with init=AdaBoost',
'Gradient Boosting Tuned'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | AdaBoost with default paramters | 0.910 | 0.900 | 0.140 | 0.070 | 0.920 | 0.570 |
| 1 | AdaBoost Tuned | 0.920 | 0.900 | 0.170 | 0.050 | 0.980 | 0.530 |
| 2 | Gradient Boosting with default parameters | 0.920 | 0.900 | 0.170 | 0.080 | 0.980 | 0.500 |
| 3 | Gradient Boosting with init=AdaBoost | 0.920 | 0.900 | 0.160 | 0.080 | 0.980 | 0.520 |
| 4 | Gradient Boosting Tuned | 0.910 | 0.900 | 0.120 | 0.070 | 0.970 | 0.550 |
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 254 Before OverSampling, count of label '0': 2313 After OverSampling, count of label '1': 2313 After OverSampling, count of label '0': 2313 After OverSampling, the shape of train_X: (4626, 51) After OverSampling, the shape of train_y: (4626,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 85.82059409273232 Random forest: 87.72114854188288 GBM: 89.5360578945892 Adaboost: 86.42226024515442 dtree: 85.69053696483502 Training Performance: Bagging: 88.49978383052313 Random forest: 89.06182447038478 GBM: 92.95287505404237 Adaboost: 87.8945092952875 dtree: 88.84565499351491
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
The gradient boosting model (GBM)is the best performing model for oversampled data.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, count of label '1': 254 Before Under Sampling, count of label '0': 2313 After Under Sampling, count of label '1': 254 After Under Sampling, count of label '0': 254 After Under Sampling, the shape of train_X: (508, 51) After Under Sampling, the shape of train_y: (508,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Bagging: 70.47058823529412 Random forest: 69.65490196078431 GBM: 76.7686274509804 Adaboost: 79.12941176470589 dtree: 65.31764705882352 Training Performance: Bagging: 75.59055118110236 Random forest: 80.70866141732283 GBM: 85.43307086614173 Adaboost: 92.91338582677166 dtree: 77.55905511811024
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Adaboost is the best performing model for the undersampled data
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"max_depth": np.arange(2, 6),
"min_samples_leaf": [1, 4, 7],
"max_leaf_nodes": [10, 15],
"min_impurity_decrease": [0.0001, 0.001],
}
# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
estimator=Model,
param_distributions=param_grid,
n_iter=10,
n_jobs=-1,
scoring=scorer,
cv=5,
random_state=1,
)
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print(
"Best parameters are {} with CV score={}:".format(
randomized_cv.best_params_, randomized_cv.best_score_
)
)
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 2} with CV score=0.9982712032388058:
# Set the clf to the best combination of parameters
dt_tuned = DecisionTreeClassifier(
max_depth=4, min_samples_leaf=1, max_leaf_nodes=15, min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
dt_tuned.fit(X_train_over, y_train_over)
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=15,
min_impurity_decrease=0.001)
# creating confusion matrix
confusion_matrix_sklearn(dt_tuned, X_train_over, y_train_over)
# Calculating different metrics on train set
dt_random_train = model_performance_classification_sklearn(
dt_tuned, X_train_over, y_train_over
)
print("Training performance:")
dt_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.628 | 0.961 | 0.577 | 0.721 |
Recall has improved in this model.
%%time
# Choose the type of classifier.
rf2 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits CPU times: user 1.55 s, sys: 82.1 ms, total: 1.63 s Wall time: 34.9 s
{'n_estimators': 250,
'min_samples_leaf': 1,
'min_impurity_decrease': 0.001,
'max_samples': 0.6,
'max_features': 'sqrt',
'max_depth': 3,
'class_weight': 'balanced_subsample'}
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
class_weight="balanced",
max_features="sqrt",
max_samples=0.5,
min_samples_leaf=2,
n_estimators=200,
random_state=1,
max_depth=3,
min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
rf2_tuned.fit(X_train_over, y_train_over)
RandomForestClassifier(class_weight='balanced', max_depth=3,
max_features='sqrt', max_samples=0.5,
min_impurity_decrease=0.001, min_samples_leaf=2,
n_estimators=200, random_state=1)
# creating confusion matrix
confusion_matrix_sklearn(rf2_tuned, X_train_over, y_train_over)
# Calculating different metrics on train set
rf2_random_train = model_performance_classification_sklearn(
rf2_tuned, X_train_over, y_train_over
)
print("Training performance:")
rf2_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.649 | 0.889 | 0.601 | 0.717 |
Recall on the training set has improved for random forest
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(100, 150, 200),
"learning_rate": [0.2, 0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=0.9719072863781287:
CPU times: user 632 ms, sys: 24.6 ms, total: 656 ms
Wall time: 11.9 s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=100,
learning_rate=0.05,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train_over, y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=100, random_state=1)
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_train_over, y_train_over)
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.704 | 0.899 | 0.647 | 0.752 |
Recall has also improved in this model
# Choose the type of classifier.
gbc_tuned_1= GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [15,26,5],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned_1, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
gbc_tuned_1 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned_1.fit(X_train_over, y_train_over)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.8, n_estimators=5, random_state=1,
subsample=0.9)
# creating confusion matrix
confusion_matrix_sklearn(gbc_tuned_1, X_train_over, y_train_over)
gbc_random_train1 = model_performance_classification_sklearn(
gbc_tuned_1, X_train_over, y_train_over
)
print("Training performance:")
gbc_random_train1
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.648 | 0.975 | 0.589 | 0.735 |
# Set the clf to the best combination of parameters
dt1_tuned = DecisionTreeClassifier(
max_depth=6, min_samples_leaf=7, max_leaf_nodes=15, min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
dt1_tuned.fit(X_train_un, y_train_un)
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=15,
min_impurity_decrease=0.001, min_samples_leaf=7)
# creating confusion matrix
confusion_matrix_sklearn(dt1_tuned, X_train_un, y_train_un)
# Calculating different metrics on validation set
dt1_random_train = model_performance_classification_sklearn(dt1_tuned, X_train_un, y_train_un)
print("Training performance:")
dt1_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.634 | 0.874 | 0.590 | 0.705 |
%%time
# Choose the type of classifier.
rf2 = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [200,250,300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train_un, y_train_un)
# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits CPU times: user 746 ms, sys: 44.7 ms, total: 790 ms Wall time: 27.8 s
{'n_estimators': 200,
'min_samples_leaf': 2,
'min_impurity_decrease': 0.001,
'max_samples': 0.5,
'max_features': 'sqrt',
'max_depth': 3,
'class_weight': 'balanced'}
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
class_weight="balanced",
max_features="sqrt",
max_samples=0.6,
min_samples_leaf=1,
n_estimators=200,
random_state=1,
max_depth=3,
min_impurity_decrease=0.001,
)
# Fit the best algorithm to the data.
rf2_tuned.fit(X_train_un, y_train_un)
RandomForestClassifier(class_weight='balanced', max_depth=3,
max_features='sqrt', max_samples=0.6,
min_impurity_decrease=0.001, n_estimators=200,
random_state=1)
# creating confusion matrix
confusion_matrix_sklearn(rf2_tuned, X_train_un, y_train_un)
# Calculating different metrics on train set
rf2_random_train = model_performance_classification_sklearn(
rf2_tuned, X_train_un, y_train_un
)
print("Training performance:")
rf2_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.646 | 0.858 | 0.602 | 0.708 |
%%time
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(100, 150, 200),
"learning_rate": [0.2, 0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=0.8936470588235295:
CPU times: user 258 ms, sys: 11.5 ms, total: 270 ms
Wall time: 3.49 s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=100,
learning_rate=0.2,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=100, random_state=1)
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_train_un, y_train_un)
Adaboost_random_train = model_performance_classification_sklearn(
adb_tuned2, X_train_un, y_train_un
)
print("Training performance:")
Adaboost_random_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.742 | 0.819 | 0.710 | 0.761 |
The models have a comparatively better recall and precision for undersampled data than with the original data.
# Choose the type of classifier.
gbc_tuned_2= GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [15,26,5],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned_1, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_un, y_train_un)
# Set the clf to the best combination of parameters
gbc_tuned_2 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned_2.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=5, random_state=1,
subsample=0.9)
# creating confusion matrix
confusion_matrix_sklearn(gbc_tuned_2, X_train_un, y_train_un)
gbc_random_train2 = model_performance_classification_sklearn(
gbc_tuned_2, X_train_un, y_train_un
)
print("Training performance:")
gbc_random_train2
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.661 | 0.933 | 0.605 | 0.734 |
The models have a comparatively better recall and precision for undersampled data than with the original data.
# training performance comparison
models_train_comp_df = pd.concat(
[
dt_random_train.T,
rf2_random_train.T,
Adaboost_random_train.T,
dt1_random_train.T,
rf2_random_train.T,
Adaboost_random_train.T,
gbc_random_train1.T,
gbc_random_train2.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Tuned DTree oversampled",
"Random forest Oversampled",
"AdaBoost Tuned with Random search",
"Tuned DTree undersampled",
"Random forest undersampled",
"Adaboost tuned with Random Search undersampled",
"GBM tuned with oversampled data",
"GBM tuned with undersampled data"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Tuned DTree oversampled | Random forest Oversampled | AdaBoost Tuned with Random search | Tuned DTree undersampled | Random forest undersampled | Adaboost tuned with Random Search undersampled | GBM tuned with oversampled data | GBM tuned with undersampled data | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.628 | 0.646 | 0.742 | 0.634 | 0.646 | 0.742 | 0.648 | 0.661 |
| Recall | 0.961 | 0.858 | 0.819 | 0.874 | 0.858 | 0.819 | 0.975 | 0.933 |
| Precision | 0.577 | 0.602 | 0.710 | 0.590 | 0.602 | 0.710 | 0.589 | 0.605 |
| F1 | 0.721 | 0.708 | 0.761 | 0.705 | 0.708 | 0.761 | 0.735 | 0.734 |
The tuned gradient boosting model has the best recall for both oversampled and undersampled training sets
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic regression: 63.37254901960784 Bagging: 70.47058823529412 Random forest: 69.65490196078431 GBM: 76.7686274509804 Adaboost: 79.12941176470589 dtree: 65.31764705882352 Validation Performance: Logistic regression: 63.503649635036496 Bagging: 62.04379562043796 Random forest: 66.42335766423358 GBM: 73.72262773722628 Adaboost: 86.86131386861314 dtree: 56.934306569343065
# Calculating different metrics on validation set
dt1_random_val = model_performance_classification_sklearn(dt1_tuned, X_val, y_val)
print("Validation performance:")
dt1_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.408 | 0.832 | 0.125 | 0.218 |
# Calculating different metrics on validation set
rf2_random_val = model_performance_classification_sklearn(
rf2_tuned, X_val, y_val
)
print("Validation performance:")
rf2_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.415 | 0.810 | 0.124 | 0.215 |
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(
adb_tuned2, X_val, y_val
)
print("Validation performance:")
Adaboost_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.524 | 0.635 | 0.125 | 0.209 |
gbc_random_val = model_performance_classification_sklearn(
gbc_tuned_2, X_val, y_val
)
print("Validation performance:")
gbc_random_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.390 | 0.905 | 0.130 | 0.227 |
# validation performance comparison
models_val_comp_df = pd.concat(
[
dt1_random_val.T,
rf2_random_val.T,
Adaboost_random_val.T,
gbc_random_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Tuned DTree val",
"Random forest val",
"AdaBoost Tuned with Random search val",
"GBM tuned val"
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
| Tuned DTree val | Random forest val | AdaBoost Tuned with Random search val | GBM tuned val | |
|---|---|---|---|---|
| Accuracy | 0.408 | 0.415 | 0.524 | 0.390 |
| Recall | 0.832 | 0.810 | 0.635 | 0.905 |
| Precision | 0.125 | 0.124 | 0.125 | 0.130 |
| F1 | 0.218 | 0.215 | 0.209 | 0.227 |
# Calculating different metrics on test set
dt1_random_test = model_performance_classification_sklearn(dt1_tuned, X_test, y_test)
print("Test performance:")
dt1_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.402 | 0.868 | 0.128 | 0.223 |
# Calculating different metrics on test set
rf2_random_test = model_performance_classification_sklearn(
rf2_tuned, X_test, y_test
)
print("Test performance:")
rf2_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.416 | 0.826 | 0.126 | 0.218 |
# Calculating different metrics on test set
Adaboost_random_test = model_performance_classification_sklearn(
adb_tuned2, X_test, y_test
)
print("Test performance:")
Adaboost_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.537 | 0.677 | 0.134 | 0.224 |
gbc_random_test = model_performance_classification_sklearn(
gbc_tuned_2, X_test, y_test
)
print("Test performance:")
gbc_random_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.384 | 0.856 | 0.123 | 0.215 |
# test performance comparison
models_test_comp_df = pd.concat(
[
dt1_random_test.T,
rf2_random_test.T,
Adaboost_random_test.T,
gbc_random_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Tuned DTree test",
"Random forest test",
"AdaBoost Tuned with Random search test",
"GBM tuned test"
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Tuned DTree test | Random forest test | AdaBoost Tuned with Random search test | GBM tuned test | |
|---|---|---|---|---|
| Accuracy | 0.402 | 0.416 | 0.537 | 0.384 |
| Recall | 0.868 | 0.826 | 0.677 | 0.856 |
| Precision | 0.128 | 0.126 | 0.134 | 0.123 |
| F1 | 0.223 | 0.218 | 0.224 | 0.215 |
feature_names = X.columns
importances = dt1_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 2.1s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.5s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.8s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 1.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.1s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 2.0s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.7s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 2.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 1.5s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 2.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.7s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.5s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 2.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 2.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.8s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.8s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.5s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 1.4s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 2.4s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.7s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 2.0s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.7s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 2.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.9s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.4s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.9s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.9s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 1.5s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 2.4s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.6s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time= 1.3s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 0.8s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time= 1.1s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time= 0.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time= 1.2s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time= 1.1s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.7s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time= 0.6s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time= 1.0s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time= 0.2s [CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time= 1.0s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.3s [CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time= 0.2s
Hospitals such as the Hospital Israelita Albert Einstein of Sao Paolo, Brazil are overwhelmed and under-resourced in identifying positive Covid-19 cases in patients with flu-like symptoms
Patients present with a multitude of signs and symptoms that make it difficult to isolate Covid-19 cases
Patient data is numerous, varied and contains errors and missing data.
Develop a predictive analytical model that can serve as a sensitive screening tool
Identify most important parameters that can be used to influence positive predictions.
Data cleaning and other pre-processing techniques enable salient and useful data to be picked up from the dataset
Performance metrics will be used to evaluate various analytical models and the model with the best recall will be selected.
The hospital management should redistribute resources to ensure that the key features or parameters that influence the prediction of positive Covid-19 cases are captured.
Patient age and type of admission must be recorded at all times.
In the absence of PCR machines and other direct modalities for covid testing, the necessary equipment and reagents needed to test for leukocytes, hematocrit, red blood cell parameters and RSV, Rhinovirus_Enterovirus infections must be supplied to the frontline areas of the hospital where patients are first encountered.
Benefits:
Cost:
The hospital management is prudent in using predictive machine learning models in making business intelligence decisions that will save lives, increase productivity and enable the delivery of value-based healthcare.
The tuned decision tree model will be an effective screening tool for the identification of positive Covid-19 cases among patients with flu-like symptoms.
It will complement hospitals in areas where it is impossible or impractical to test everyone for Covid-19 infection